"매 batch ETL 의 X — 매 unbounded events 매 milliseconds latency 매 process". Kafka (LinkedIn 2010) → Flink / Spark Structured Streaming / Pulsar / Materialize / RisingWave 매 modern stack. 매 2026 매 sub-second analytics 매 default.
매 핵심
매 layers
Ingest: Kafka, Pulsar, Kinesis, Redpanda — 매 durable log.
CREATESOURCEordersFROMKAFKABROKER'kafka:9092'TOPIC'orders'FORMATAVROUSINGSCHEMAREGISTRY'http://sr:8081';CREATEMATERIALIZEDVIEWrevenue_5minASSELECTdate_trunc('minute',ts)ASminute,SUM(amount)ASrevenueFROMordersWHEREts>now()-INTERVAL'5 minutes'GROUPBY1;-- Subscribe to changes
SUBSCRIBETOrevenue_5min;
Spark Structured Streaming
df=(spark.readStream.format("kafka").option("subscribe","events").load())agg=(df.selectExpr("CAST(value AS STRING) as json").select(from_json("json",schema).alias("e")).withWatermark("e.ts","10 minutes").groupBy(window("e.ts","1 minute"),"e.user").count())agg.writeStream.format("delta").outputMode("append").start("/lake/agg")
언제: 매 SQL DDL/query generation, 매 schema evolution analysis, 매 anomaly investigation summarization.
언제 X: 매 latency-critical hot path — LLM inference 매 too slow. 매 trained ML model 사용.
❌ 안티패턴
Processing time everywhere: 매 out-of-order events 매 wrong results — event time + watermarks 사용.
Unbounded state: 매 keyed state 매 grows forever — TTL / windows 필수.
Tiny files: 매 1 record / file → S3 explosion. 매 batching + compaction.
Sync external calls in pipeline: 매 backpressure 폭발. 매 async + bulkhead.
No replay strategy: 매 bad code → poisoned downstream. 매 reset offset + idempotent sinks.