Files
2nd/10_Wiki/Topics/Architecture/Event stream processing.md
T
2026-05-10 22:08:15 +09:00

6.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-event-stream-processing Event Stream Processing 10_Wiki/Topics verified self
Stream Processing
ESP
Streaming Analytics
none A 0.93 applied
streaming
kafka
flink
real-time
data-engineering
2026-05-10 pending
language framework
polyglot kafka-flink

Event Stream Processing

매 한 줄

"매 unbounded data 의 매 record-by-record 의 transform". 매 batch 의 ETL 의 opposite — 매 event 의 arrive 시점 에 immediately compute. 매 2026 의 매 dominant stack 의 Kafka + Flink (or Kafka Streams), 매 emerging 의 RisingWave / Materialize (streaming SQL DB), 매 cloud-native 의 Confluent Cloud / AWS Kinesis / GCP Dataflow.

매 핵심

매 batch vs streaming

  • Batch: 매 hourly/daily, 매 high latency, 매 reprocess easy.
  • Streaming: 매 sub-second, 매 always-on, 매 stateful.
  • Lambda architecture: batch + streaming hybrid (매 deprecated 2026).
  • Kappa architecture: streaming-only, 매 replay 의 reprocess.

매 핵심 개념

  • Event time vs processing time — 매 event 의 produce 시점 vs broker 의 receive.
  • Watermark — 매 "event time T 까지의 모든 event 의 도착 의 expect" 의 signal.
  • Window — tumbling / sliding / session.
  • Exactly-once semantics — Kafka transactions + Flink checkpoints.
  • Stateful operator — 매 RocksDB / state backend.

매 frameworks (2026)

  • Apache Flink 1.20 — 매 most powerful, 매 unified batch+streaming, 매 exactly-once.
  • Kafka Streams 3.7 — 매 JVM 만, 매 library (no cluster).
  • Apache Beam — 매 portable runner (Flink/Dataflow/Spark).
  • RisingWave / Materialize — 매 streaming SQL DB, 매 incremental view maintenance.
  • Bytewax — 매 Python-native, 매 Rust core.

매 응용

  1. Real-time fraud detection.
  2. IoT telemetry aggregation.
  3. Live dashboards (clickstream).
  4. CDC → search index sync.
  5. AI feature stores (real-time).

💻 패턴

Pattern 1: Kafka producer (TS)

import { Kafka } from "kafkajs";
const kafka = new Kafka({ brokers: ["broker:9092"] });
const producer = kafka.producer({ idempotent: true, maxInFlightRequests: 5 });

await producer.connect();
await producer.send({
  topic: "orders",
  messages: [{
    key: order.id,
    value: JSON.stringify(order),
    headers: { "event-time": Date.now().toString() },
  }],
});
DataStream<Order> orders = env.fromSource(
    KafkaSource.<Order>builder()
        .setBootstrapServers("broker:9092")
        .setTopics("orders")
        .setDeserializer(new OrderDeserializer())
        .build(),
    WatermarkStrategy
        .<Order>forBoundedOutOfOrderness(Duration.ofSeconds(5))
        .withTimestampAssigner((o, ts) -> o.eventTime),
    "kafka-source");

orders
    .keyBy(o -> o.userId)
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new OrderCountAgg())
    .sinkTo(new MetricsSink());

Pattern 3: Kafka Streams aggregation

StreamsBuilder builder = new StreamsBuilder();
KStream<String, Order> orders = builder.stream("orders");
orders
    .groupBy((k, v) -> v.userId)
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count(Materialized.as("user-order-count"))
    .toStream()
    .to("user-order-counts");
-- 매 RisingWave / Flink SQL 동일
CREATE SOURCE orders (
    order_id VARCHAR, user_id VARCHAR, amount DOUBLE,
    event_time TIMESTAMP, WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (connector='kafka', topic='orders', properties.bootstrap.servers='broker:9092') FORMAT JSON;

CREATE MATERIALIZED VIEW user_revenue_5m AS
SELECT user_id,
       window_start, window_end,
       SUM(amount) AS revenue
FROM TUMBLE(orders, event_time, INTERVAL '5' MINUTES)
GROUP BY user_id, window_start, window_end;

Pattern 5: Bytewax (Python streaming)

from bytewax.dataflow import Dataflow
from bytewax.connectors.kafka import KafkaSource, KafkaSink
from bytewax import operators as op

flow = Dataflow("fraud-detect")
src = op.input("kafka-in", flow, KafkaSource(brokers=["broker:9092"], topics=["tx"]))
parsed = op.map("parse", src, lambda kv: json.loads(kv.value))
flagged = op.filter("suspicious", parsed, lambda tx: tx["amount"] > 10_000)
op.output("kafka-out", flagged, KafkaSink(brokers=["broker:9092"], topic="alerts"))
# Debezium connector config
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: pg
database.dbname: app
plugin.name: pgoutput
table.include.list: public.orders
topic.prefix: cdc
KafkaSink<Order> sink = KafkaSink.<Order>builder()
    .setBootstrapServers("broker:9092")
    .setRecordSerializer(KafkaRecordSerializationSchema.builder()
        .setTopic("processed-orders")
        .setValueSerializationSchema(new OrderSerializer())
        .build())
    .setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
    .setTransactionalIdPrefix("orders-")
    .build();

매 결정 기준

상황 Approach
Sub-second latency, JVM team Flink
Lightweight, library-only Kafka Streams
Python team, simple jobs Bytewax
SQL-only, fast iteration RisingWave / Materialize
Multi-runtime portability Apache Beam
Cloud-managed Confluent / Dataflow / Kinesis

기본값: 매 Kafka + Flink (전통 stack), 매 SQL-only 시 RisingWave.

🔗 Graph

🤖 LLM 활용

언제: 매 real-time pipeline 설계, 매 streaming SQL 의 generate. 언제 X: 매 daily batch 의 충분, 매 small data 의 cron job.

안티패턴

  • No watermark: 매 late event 의 silently drop or wrong window.
  • At-least-once + idempotent missing: 매 duplicate 의 downstream impact.
  • Unbounded state: 매 keyed state 의 TTL 의 missing — 매 OOM.
  • Reorder reliance: 매 partition 별 only 의 ordering — 매 cross-partition 의 X.
  • Synchronous external call inside operator: 매 backpressure / slow.

🧪 검증 / 중복

  • Verified (Flink docs, Kafka docs, "Streaming Systems" Akidau et al., RisingWave docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Flink/Kafka Streams/RisingWave + windowing