Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

6.0 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Real-time Data Streaming

매 한 줄

"매 batch ETL 의 X — 매 unbounded events 매 milliseconds latency 매 process". Kafka (LinkedIn 2010) → Flink / Spark Structured Streaming / Pulsar / Materialize / RisingWave 매 modern stack. 매 2026 매 sub-second analytics 매 default.

매 핵심

매 layers

Ingest: Kafka, Pulsar, Kinesis, Redpanda — 매 durable log.
Process: Flink, Spark Streaming, Kafka Streams, Bytewax, Arroyo.
Serve: Materialize, RisingWave, Pinot, Druid, ClickHouse.

매 windowing

Tumbling: fixed, non-overlapping (1-min buckets).
Sliding: overlapping (5-min, slide 1-min).
Session: gap-based (user activity until 30s inactivity).
Hopping: same as sliding (different name).

매 time semantics

Event time: 매 actual occurrence — 매 correctness 위해 default.
Processing time: 매 broker arrival — 매 latency 측정.
Ingestion time: 매 broker append.
Watermarks: 매 lateness threshold — Flink/Beam 핵심.

매 응용

Fraud detection — 매 payment stream + ML inference.
Real-time dashboards — Materialize + Grafana.
CDC pipelines — Debezium → Kafka → warehouse.
IoT telemetry — MQTT → Kafka → anomaly detection.
Personalization — clickstream → feature store → model.

💻 패턴

Kafka Streams (Python via Faust / Bytewax)

import bytewax.operators as op
from bytewax.dataflow import Dataflow
from bytewax.connectors.kafka import KafkaSource, KafkaSink

flow = Dataflow("fraud")
src = op.input("in", flow, KafkaSource(["localhost:9092"], ["payments"]))
parsed = op.map("parse", src, lambda kv: json.loads(kv.value))
flagged = op.filter("flag", parsed, lambda p: p["amount"] > 10000)
op.output("out", flagged, KafkaSink(["localhost:9092"], "alerts"))

Flink SQL (windowed aggregation)

CREATE TABLE clicks (
  user_id STRING, ts TIMESTAMP(3),
  WATERMARK FOR ts AS ts - INTERVAL '5' SECOND
) WITH ('connector' = 'kafka', ...);

SELECT
  user_id,
  TUMBLE_START(ts, INTERVAL '1' MINUTE) AS w_start,
  COUNT(*) AS clicks
FROM clicks
GROUP BY user_id, TUMBLE(ts, INTERVAL '1' MINUTE);

Materialize (streaming SQL view)

CREATE SOURCE orders FROM KAFKA BROKER 'kafka:9092' TOPIC 'orders'
  FORMAT AVRO USING SCHEMA REGISTRY 'http://sr:8081';

CREATE MATERIALIZED VIEW revenue_5min AS
SELECT
  date_trunc('minute', ts) AS minute,
  SUM(amount) AS revenue
FROM orders
WHERE ts > now() - INTERVAL '5 minutes'
GROUP BY 1;

-- Subscribe to changes
SUBSCRIBE TO revenue_5min;

Spark Structured Streaming

df = (spark.readStream.format("kafka")
        .option("subscribe", "events")
        .load())
agg = (df.selectExpr("CAST(value AS STRING) as json")
         .select(from_json("json", schema).alias("e"))
         .withWatermark("e.ts", "10 minutes")
         .groupBy(window("e.ts", "1 minute"), "e.user")
         .count())
agg.writeStream.format("delta").outputMode("append").start("/lake/agg")

Pulsar Functions (lightweight processing)

from pulsar import Function

class EnrichOrder(Function):
    def process(self, msg, ctx):
        order = json.loads(msg)
        order["region"] = lookup_region(order["zip"])
        ctx.publish("orders.enriched", json.dumps(order))

Exactly-once with Kafka transactions

producer = KafkaProducer(transactional_id="tx-1", enable_idempotence=True)
producer.init_transactions()
try:
    producer.begin_transaction()
    producer.send("out", value=processed)
    producer.send_offsets_to_transaction(offsets, group_id)
    producer.commit_transaction()
except:
    producer.abort_transaction()

CDC with Debezium

# connector config
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: pg
table.include.list: public.orders
plugin.name: pgoutput
# emits CDC events to Kafka topic "pg.public.orders"

매 결정 기준

상황	Stack
매 simple transform	Kafka Streams / Bytewax
매 complex windowing, joins	Flink
매 SQL-first analytics	Materialize / RisingWave
매 batch+stream unified	Spark Structured / Beam
매 lightweight, serverless	Pulsar Functions / AWS Lambda
매 OLAP serving	Pinot / Druid / ClickHouse

기본값: 매 2026 매 SQL-on-streams (Materialize/RisingWave) 매 default — DX 압도적.

🔗 Graph

부모: Data Engineering · Event-Driven Architecture
변형: Stream-Processing-Architectures · CEP
응용: CDC
Adjacent: Kafka · Materialize

🤖 LLM 활용

언제: 매 SQL DDL/query generation, 매 schema evolution analysis, 매 anomaly investigation summarization. 언제 X: 매 latency-critical hot path — LLM inference 매 too slow. 매 trained ML model 사용.

❌ 안티패턴

Processing time everywhere: 매 out-of-order events 매 wrong results — event time + watermarks 사용.
Unbounded state: 매 keyed state 매 grows forever — TTL / windows 필수.
Tiny files: 매 1 record / file → S3 explosion. 매 batching + compaction.
Sync external calls in pipeline: 매 backpressure 폭발. 매 async + bulkhead.
No replay strategy: 매 bad code → poisoned downstream. 매 reset offset + idempotent sinks.

🧪 검증 / 중복

Verified (Akidau — "Streaming 101/102"; Kafka docs; Flink docs 1.18+).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — full streaming entry with Materialize/RisingWave

6.0 KiB Raw Blame History