--- id: wiki-2026-0508-stream-processing-architectures title: Stream Processing Architectures category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Stream Processing, Streaming Systems, Real-time Data Processing] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [stream-processing, kafka, flink, architecture] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: java framework: kafka-streams-flink --- # Stream Processing Architectures ## 매 한 줄 > **"매 unbounded data 의 continuous compute"**. 매 batch 의 finite data 의 처리와 달리 매 stream 의 무한 event flow 의 sub-second latency 의 처리. 2026 의 standard stack 의 Kafka + Flink + Iceberg 의 lakehouse streaming. ## 매 핵심 ### 매 Stream vs Batch - **Batch**: bounded, high throughput, hours latency (Spark, Hadoop). - **Stream**: unbounded, lower throughput, ms-sec latency (Flink, Kafka Streams). - **Unified**: 매 single API 의 batch + stream (Flink Table API, Beam). ### 매 Processing semantics - **At-most-once**: drop on failure (low latency, lossy). - **At-least-once**: retry (duplicates possible). - **Exactly-once**: 매 idempotent + transactional (Kafka EOS, Flink checkpoints). ### 매 Time semantics - **Event time**: 매 sensor emit 시각 (correct but late). - **Processing time**: 매 system clock 시각 (fast but wrong on lag). - **Watermark**: 매 event time 의 progress marker — 매 late event 의 cutoff. ### 매 응용 1. Real-time fraud detection (sub-100ms decision). 2. Trading / market data aggregation. 3. CDC pipelines (Debezium → Kafka → Flink → warehouse). 4. IoT telemetry (sensor → MQTT → stream proc). ## 💻 패턴 ### Kafka Streams — windowed aggregation ```java KStream orders = builder.stream("orders"); orders .groupBy((k, v) -> v.getCustomerId()) .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5))) .aggregate( () -> 0.0, (k, order, total) -> total + order.getAmount(), Materialized.as("customer-5min-total")) .toStream() .to("customer-totals"); ``` ### Flink — event-time + watermark ```java DataStream stream = env .fromSource(kafkaSource, WatermarkStrategy .forBoundedOutOfOrderness(Duration.ofSeconds(10)) .withTimestampAssigner((e, ts) -> e.getEventTime()), "kafka"); stream .keyBy(Event::getUserId) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .aggregate(new CountAgg()) .sinkTo(icebergSink); ``` ### Flink SQL — streaming join ```sql SELECT o.order_id, o.amount, u.tier FROM orders o JOIN users FOR SYSTEM_TIME AS OF o.proc_time AS u ON o.user_id = u.id WHERE o.amount > 100; ``` ### Stateful processing — Flink ProcessFunction ```java public class FraudDetector extends KeyedProcessFunction { private ValueState lastAmount; @Override public void open(Configuration cfg) { lastAmount = getRuntimeContext().getState( new ValueStateDescriptor<>("last", Double.class)); } @Override public void processElement(Tx tx, Context ctx, Collector out) throws Exception { Double prev = lastAmount.value(); if (prev != null && tx.amount > prev * 10) { out.collect(new Alert(tx.id, "spike")); } lastAmount.update(tx.amount); } } ``` ### Exactly-once with Kafka transactions ```java producer.initTransactions(); try { producer.beginTransaction(); producer.send(record1); producer.send(record2); producer.sendOffsetsToTransaction(offsets, consumerGroup); producer.commitTransaction(); } catch (Exception e) { producer.abortTransaction(); } ``` ### Backpressure — Flink credit-based flow control ```java // Flink auto-handles via network buffers + credit env.setParallelism(8); env.getConfig().setAutoWatermarkInterval(200); env.enableCheckpointing(60_000, CheckpointingMode.EXACTLY_ONCE); ``` ### Lakehouse streaming sink — Iceberg ```java FlinkSink.forRowData(stream) .table(icebergTable) .tableLoader(loader) .upsert(true) .equalityFieldColumns(List.of("id")) .append(); ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Simple ETL, Kafka-native | Kafka Streams | | Complex CEP, large state | Flink | | Unified batch+stream | Flink / Beam | | SQL-only team | Flink SQL / ksqlDB | | Tiny scale | Single consumer + handler | **기본값**: 매 Kafka + Flink — 매 production-grade exactly-once streaming. ## 🔗 Graph - 부모: [[Distributed Systems]] · [[Event-Driven Architecture]] - 변형: [[Apache Flink]] ## 🤖 LLM 활용 **언제**: continuous unbounded data, sub-second latency, stateful aggregation. **언제 X**: hourly/daily batch (use Spark), tiny volumes (use cron). ## ❌ 안티패턴 - **Processing-time on lagged sources**: 매 watermark/event-time 의 사용. - **Unbounded state**: 매 TTL 의 set — state 의 무한 grow 의 OOM. - **Single-partition hot key**: 매 skew 의 partition rebalance. - **Sync external call in operator**: 매 AsyncIO 의 사용. ## 🧪 검증 / 중복 - Verified (Apache Flink docs, Kafka Streams Developer Guide 2026). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Kafka Streams + Flink patterns, EOS, watermarks |