--- id: wiki-2026-0508-real-time-data-streaming title: Real-time Data Streaming category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Stream Processing, Event Streaming, Real-time Pipelines] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [streaming, kafka, pulsar, flink, data-engineering] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: kafka-flink-pulsar --- # Real-time Data Streaming ## 매 한 줄 > **"매 batch ETL 의 X — 매 unbounded events 매 milliseconds latency 매 process"**. Kafka (LinkedIn 2010) → Flink / Spark Structured Streaming / Pulsar / Materialize / RisingWave 매 modern stack. 매 2026 매 sub-second analytics 매 default. ## 매 핵심 ### 매 layers - **Ingest**: Kafka, Pulsar, Kinesis, Redpanda — 매 durable log. - **Process**: Flink, Spark Streaming, Kafka Streams, Bytewax, Arroyo. - **Serve**: Materialize, RisingWave, Pinot, Druid, ClickHouse. ### 매 windowing - **Tumbling**: fixed, non-overlapping (1-min buckets). - **Sliding**: overlapping (5-min, slide 1-min). - **Session**: gap-based (user activity until 30s inactivity). - **Hopping**: same as sliding (different name). ### 매 time semantics - **Event time**: 매 actual occurrence — 매 correctness 위해 default. - **Processing time**: 매 broker arrival — 매 latency 측정. - **Ingestion time**: 매 broker append. - **Watermarks**: 매 lateness threshold — Flink/Beam 핵심. ### 매 응용 1. Fraud detection — 매 payment stream + ML inference. 2. Real-time dashboards — Materialize + Grafana. 3. CDC pipelines — Debezium → Kafka → warehouse. 4. IoT telemetry — MQTT → Kafka → anomaly detection. 5. Personalization — clickstream → feature store → model. ## 💻 패턴 ### Kafka Streams (Python via Faust / Bytewax) ```python import bytewax.operators as op from bytewax.dataflow import Dataflow from bytewax.connectors.kafka import KafkaSource, KafkaSink flow = Dataflow("fraud") src = op.input("in", flow, KafkaSource(["localhost:9092"], ["payments"])) parsed = op.map("parse", src, lambda kv: json.loads(kv.value)) flagged = op.filter("flag", parsed, lambda p: p["amount"] > 10000) op.output("out", flagged, KafkaSink(["localhost:9092"], "alerts")) ``` ### Flink SQL (windowed aggregation) ```sql CREATE TABLE clicks ( user_id STRING, ts TIMESTAMP(3), WATERMARK FOR ts AS ts - INTERVAL '5' SECOND ) WITH ('connector' = 'kafka', ...); SELECT user_id, TUMBLE_START(ts, INTERVAL '1' MINUTE) AS w_start, COUNT(*) AS clicks FROM clicks GROUP BY user_id, TUMBLE(ts, INTERVAL '1' MINUTE); ``` ### Materialize (streaming SQL view) ```sql CREATE SOURCE orders FROM KAFKA BROKER 'kafka:9092' TOPIC 'orders' FORMAT AVRO USING SCHEMA REGISTRY 'http://sr:8081'; CREATE MATERIALIZED VIEW revenue_5min AS SELECT date_trunc('minute', ts) AS minute, SUM(amount) AS revenue FROM orders WHERE ts > now() - INTERVAL '5 minutes' GROUP BY 1; -- Subscribe to changes SUBSCRIBE TO revenue_5min; ``` ### Spark Structured Streaming ```python df = (spark.readStream.format("kafka") .option("subscribe", "events") .load()) agg = (df.selectExpr("CAST(value AS STRING) as json") .select(from_json("json", schema).alias("e")) .withWatermark("e.ts", "10 minutes") .groupBy(window("e.ts", "1 minute"), "e.user") .count()) agg.writeStream.format("delta").outputMode("append").start("/lake/agg") ``` ### Pulsar Functions (lightweight processing) ```python from pulsar import Function class EnrichOrder(Function): def process(self, msg, ctx): order = json.loads(msg) order["region"] = lookup_region(order["zip"]) ctx.publish("orders.enriched", json.dumps(order)) ``` ### Exactly-once with Kafka transactions ```python producer = KafkaProducer(transactional_id="tx-1", enable_idempotence=True) producer.init_transactions() try: producer.begin_transaction() producer.send("out", value=processed) producer.send_offsets_to_transaction(offsets, group_id) producer.commit_transaction() except: producer.abort_transaction() ``` ### CDC with Debezium ```yaml # connector config connector.class: io.debezium.connector.postgresql.PostgresConnector database.hostname: pg table.include.list: public.orders plugin.name: pgoutput # emits CDC events to Kafka topic "pg.public.orders" ``` ## 매 결정 기준 | 상황 | Stack | |---|---| | 매 simple transform | Kafka Streams / Bytewax | | 매 complex windowing, joins | Flink | | 매 SQL-first analytics | Materialize / RisingWave | | 매 batch+stream unified | Spark Structured / Beam | | 매 lightweight, serverless | Pulsar Functions / AWS Lambda | | 매 OLAP serving | Pinot / Druid / ClickHouse | **기본값**: 매 2026 매 SQL-on-streams (Materialize/RisingWave) 매 default — DX 압도적. ## 🔗 Graph - 부모: [[Data Engineering]] · [[Event-Driven Architecture]] - 변형: [[Stream-Processing-Architectures|Stream Processing]] · [[CEP]] - 응용: [[CDC]] - Adjacent: [[Kafka]] · [[Materialize]] ## 🤖 LLM 활용 **언제**: 매 SQL DDL/query generation, 매 schema evolution analysis, 매 anomaly investigation summarization. **언제 X**: 매 latency-critical hot path — LLM inference 매 too slow. 매 trained ML model 사용. ## ❌ 안티패턴 - **Processing time everywhere**: 매 out-of-order events 매 wrong results — event time + watermarks 사용. - **Unbounded state**: 매 keyed state 매 grows forever — TTL / windows 필수. - **Tiny files**: 매 1 record / file → S3 explosion. 매 batching + compaction. - **Sync external calls in pipeline**: 매 backpressure 폭발. 매 async + bulkhead. - **No replay strategy**: 매 bad code → poisoned downstream. 매 reset offset + idempotent sinks. ## 🧪 검증 / 중복 - Verified (Akidau — "Streaming 101/102"; Kafka docs; Flink docs 1.18+). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full streaming entry with Materialize/RisingWave |