Files
2nd/10_Wiki/Topics/Architecture/Real-time-Data-Streaming.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

6.0 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-real-time-data-streaming Real-time Data Streaming 10_Wiki/Topics verified self
Stream Processing
Event Streaming
Real-time Pipelines
none A 0.95 applied
streaming
kafka
pulsar
flink
data-engineering
2026-05-10 pending
language framework
python kafka-flink-pulsar

Real-time Data Streaming

매 한 줄

"매 batch ETL 의 X — 매 unbounded events 매 milliseconds latency 매 process". Kafka (LinkedIn 2010) → Flink / Spark Structured Streaming / Pulsar / Materialize / RisingWave 매 modern stack. 매 2026 매 sub-second analytics 매 default.

매 핵심

매 layers

  • Ingest: Kafka, Pulsar, Kinesis, Redpanda — 매 durable log.
  • Process: Flink, Spark Streaming, Kafka Streams, Bytewax, Arroyo.
  • Serve: Materialize, RisingWave, Pinot, Druid, ClickHouse.

매 windowing

  • Tumbling: fixed, non-overlapping (1-min buckets).
  • Sliding: overlapping (5-min, slide 1-min).
  • Session: gap-based (user activity until 30s inactivity).
  • Hopping: same as sliding (different name).

매 time semantics

  • Event time: 매 actual occurrence — 매 correctness 위해 default.
  • Processing time: 매 broker arrival — 매 latency 측정.
  • Ingestion time: 매 broker append.
  • Watermarks: 매 lateness threshold — Flink/Beam 핵심.

매 응용

  1. Fraud detection — 매 payment stream + ML inference.
  2. Real-time dashboards — Materialize + Grafana.
  3. CDC pipelines — Debezium → Kafka → warehouse.
  4. IoT telemetry — MQTT → Kafka → anomaly detection.
  5. Personalization — clickstream → feature store → model.

💻 패턴

Kafka Streams (Python via Faust / Bytewax)

import bytewax.operators as op
from bytewax.dataflow import Dataflow
from bytewax.connectors.kafka import KafkaSource, KafkaSink

flow = Dataflow("fraud")
src = op.input("in", flow, KafkaSource(["localhost:9092"], ["payments"]))
parsed = op.map("parse", src, lambda kv: json.loads(kv.value))
flagged = op.filter("flag", parsed, lambda p: p["amount"] > 10000)
op.output("out", flagged, KafkaSink(["localhost:9092"], "alerts"))
CREATE TABLE clicks (
  user_id STRING, ts TIMESTAMP(3),
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);

SELECT
  user_id,
  TUMBLE_START(ts, INTERVAL '1' MINUTE) AS w_start,
  COUNT(*) AS clicks
FROM clicks
GROUP BY user_id, TUMBLE(ts, INTERVAL '1' MINUTE);

Materialize (streaming SQL view)

CREATE SOURCE orders FROM KAFKA BROKER 'kafka:9092' TOPIC 'orders'
  FORMAT AVRO USING SCHEMA REGISTRY 'http://sr:8081';

CREATE MATERIALIZED VIEW revenue_5min AS
SELECT
  date_trunc('minute', ts) AS minute,
  SUM(amount) AS revenue
FROM orders
WHERE ts > now() - INTERVAL '5 minutes'
GROUP BY 1;

-- Subscribe to changes
SUBSCRIBE TO revenue_5min;

Spark Structured Streaming

df = (spark.readStream.format("kafka")
        .option("subscribe", "events")
        .load())
agg = (df.selectExpr("CAST(value AS STRING) as json")
         .select(from_json("json", schema).alias("e"))
         .withWatermark("e.ts", "10 minutes")
         .groupBy(window("e.ts", "1 minute"), "e.user")
         .count())
agg.writeStream.format("delta").outputMode("append").start("/lake/agg")

Pulsar Functions (lightweight processing)

from pulsar import Function

class EnrichOrder(Function):
    def process(self, msg, ctx):
        order = json.loads(msg)
        order["region"] = lookup_region(order["zip"])
        ctx.publish("orders.enriched", json.dumps(order))

Exactly-once with Kafka transactions

producer = KafkaProducer(transactional_id="tx-1", enable_idempotence=True)
producer.init_transactions()
try:
    producer.begin_transaction()
    producer.send("out", value=processed)
    producer.send_offsets_to_transaction(offsets, group_id)
    producer.commit_transaction()
except:
    producer.abort_transaction()

CDC with Debezium

# connector config
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: pg
table.include.list: public.orders
plugin.name: pgoutput
# emits CDC events to Kafka topic "pg.public.orders"

매 결정 기준

상황 Stack
매 simple transform Kafka Streams / Bytewax
매 complex windowing, joins Flink
매 SQL-first analytics Materialize / RisingWave
매 batch+stream unified Spark Structured / Beam
매 lightweight, serverless Pulsar Functions / AWS Lambda
매 OLAP serving Pinot / Druid / ClickHouse

기본값: 매 2026 매 SQL-on-streams (Materialize/RisingWave) 매 default — DX 압도적.

🔗 Graph

🤖 LLM 활용

언제: 매 SQL DDL/query generation, 매 schema evolution analysis, 매 anomaly investigation summarization. 언제 X: 매 latency-critical hot path — LLM inference 매 too slow. 매 trained ML model 사용.

안티패턴

  • Processing time everywhere: 매 out-of-order events 매 wrong results — event time + watermarks 사용.
  • Unbounded state: 매 keyed state 매 grows forever — TTL / windows 필수.
  • Tiny files: 매 1 record / file → S3 explosion. 매 batching + compaction.
  • Sync external calls in pipeline: 매 backpressure 폭발. 매 async + bulkhead.
  • No replay strategy: 매 bad code → poisoned downstream. 매 reset offset + idempotent sinks.

🧪 검증 / 중복

  • Verified (Akidau — "Streaming 101/102"; Kafka docs; Flink docs 1.18+).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full streaming entry with Materialize/RisingWave