Files
2nd/10_Wiki/Topics/Architecture/Event stream processing.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

196 lines
6.7 KiB
Markdown

---
id: wiki-2026-0508-event-stream-processing
title: Event Stream Processing
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Stream Processing, ESP, Streaming Analytics]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [streaming, kafka, flink, real-time, data-engineering]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: polyglot
framework: kafka-flink
---
# Event Stream Processing
## 매 한 줄
> **"매 unbounded data 의 매 record-by-record 의 transform"**. 매 batch 의 ETL 의 opposite — 매 event 의 arrive 시점 에 immediately compute. 매 2026 의 매 dominant stack 의 Kafka + Flink (or Kafka Streams), 매 emerging 의 RisingWave / Materialize (streaming SQL DB), 매 cloud-native 의 Confluent Cloud / AWS Kinesis / GCP Dataflow.
## 매 핵심
### 매 batch vs streaming
- **Batch**: 매 hourly/daily, 매 high latency, 매 reprocess easy.
- **Streaming**: 매 sub-second, 매 always-on, 매 stateful.
- **Lambda architecture**: batch + streaming hybrid (매 deprecated 2026).
- **Kappa architecture**: streaming-only, 매 replay 의 reprocess.
### 매 핵심 개념
- **Event time vs processing time** — 매 event 의 produce 시점 vs broker 의 receive.
- **Watermark** — 매 "event time T 까지의 모든 event 의 도착 의 expect" 의 signal.
- **Window** — tumbling / sliding / session.
- **Exactly-once semantics** — Kafka transactions + Flink checkpoints.
- **Stateful operator** — 매 RocksDB / state backend.
### 매 frameworks (2026)
- **Apache Flink 1.20** — 매 most powerful, 매 unified batch+streaming, 매 exactly-once.
- **Kafka Streams 3.7** — 매 JVM 만, 매 library (no cluster).
- **Apache Beam** — 매 portable runner (Flink/Dataflow/Spark).
- **RisingWave / Materialize** — 매 streaming SQL DB, 매 incremental view maintenance.
- **Bytewax** — 매 Python-native, 매 Rust core.
### 매 응용
1. Real-time fraud detection.
2. IoT telemetry aggregation.
3. Live dashboards (clickstream).
4. CDC → search index sync.
5. AI feature stores (real-time).
## 💻 패턴
### Pattern 1: Kafka producer (TS)
```typescript
import { Kafka } from "kafkajs";
const kafka = new Kafka({ brokers: ["broker:9092"] });
const producer = kafka.producer({ idempotent: true, maxInFlightRequests: 5 });
await producer.connect();
await producer.send({
topic: "orders",
messages: [{
key: order.id,
value: JSON.stringify(order),
headers: { "event-time": Date.now().toString() },
}],
});
```
### Pattern 2: Flink DataStream (Java)
```java
DataStream<Order> orders = env.fromSource(
KafkaSource.<Order>builder()
.setBootstrapServers("broker:9092")
.setTopics("orders")
.setDeserializer(new OrderDeserializer())
.build(),
WatermarkStrategy
.<Order>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((o, ts) -> o.eventTime),
"kafka-source");
orders
.keyBy(o -> o.userId)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new OrderCountAgg())
.sinkTo(new MetricsSink());
```
### Pattern 3: Kafka Streams aggregation
```java
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Order> orders = builder.stream("orders");
orders
.groupBy((k, v) -> v.userId)
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
.count(Materialized.as("user-order-count"))
.toStream()
.to("user-order-counts");
```
### Pattern 4: Streaming SQL (Flink SQL / RisingWave)
```sql
-- 매 RisingWave / Flink SQL 동일
CREATE SOURCE orders (
order_id VARCHAR, user_id VARCHAR, amount DOUBLE,
event_time TIMESTAMP, WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (connector='kafka', topic='orders', properties.bootstrap.servers='broker:9092') FORMAT JSON;
CREATE MATERIALIZED VIEW user_revenue_5m AS
SELECT user_id,
window_start, window_end,
SUM(amount) AS revenue
FROM TUMBLE(orders, event_time, INTERVAL '5' MINUTES)
GROUP BY user_id, window_start, window_end;
```
### Pattern 5: Bytewax (Python streaming)
```python
from bytewax.dataflow import Dataflow
from bytewax.connectors.kafka import KafkaSource, KafkaSink
from bytewax import operators as op
flow = Dataflow("fraud-detect")
src = op.input("kafka-in", flow, KafkaSource(brokers=["broker:9092"], topics=["tx"]))
parsed = op.map("parse", src, lambda kv: json.loads(kv.value))
flagged = op.filter("suspicious", parsed, lambda tx: tx["amount"] > 10_000)
op.output("kafka-out", flagged, KafkaSink(brokers=["broker:9092"], topic="alerts"))
```
### Pattern 6: CDC stream (Debezium → Kafka → Flink)
```yaml
# Debezium connector config
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: pg
database.dbname: app
plugin.name: pgoutput
table.include.list: public.orders
topic.prefix: cdc
```
### Pattern 7: Exactly-once sink (Flink)
```java
KafkaSink<Order> sink = KafkaSink.<Order>builder()
.setBootstrapServers("broker:9092")
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic("processed-orders")
.setValueSerializationSchema(new OrderSerializer())
.build())
.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setTransactionalIdPrefix("orders-")
.build();
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Sub-second latency, JVM team | Flink |
| Lightweight, library-only | Kafka Streams |
| Python team, simple jobs | Bytewax |
| SQL-only, fast iteration | RisingWave / Materialize |
| Multi-runtime portability | Apache Beam |
| Cloud-managed | Confluent / Dataflow / Kinesis |
**기본값**: 매 Kafka + Flink (전통 stack), 매 SQL-only 시 RisingWave.
## 🔗 Graph
- 부모: [[Data Engineering]] · [[Distributed Systems]]
- 응용: [[Feature Store]]
- Adjacent: [[Apache Flink]] · [[Event Sourcing]] · [[CQRS]]
## 🤖 LLM 활용
**언제**: 매 real-time pipeline 설계, 매 streaming SQL 의 generate.
**언제 X**: 매 daily batch 의 충분, 매 small data 의 cron job.
## ❌ 안티패턴
- **No watermark**: 매 late event 의 silently drop or wrong window.
- **At-least-once + idempotent missing**: 매 duplicate 의 downstream impact.
- **Unbounded state**: 매 keyed state 의 TTL 의 missing — 매 OOM.
- **Reorder reliance**: 매 partition 별 only 의 ordering — 매 cross-partition 의 X.
- **Synchronous external call inside operator**: 매 backpressure / slow.
## 🧪 검증 / 중복
- Verified (Flink docs, Kafka docs, "Streaming Systems" Akidau et al., RisingWave docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Flink/Kafka Streams/RisingWave + windowing |