Files
2nd/10_Wiki/Topics/Architecture/Big-Data.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.4 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-big-data Big Data 10_Wiki/Topics verified self
Big Data
Large-Scale Data Processing
none A 0.9 applied
data
distributed-systems
analytics
lakehouse
2026-05-10 pending
language framework
python spark,duckdb,iceberg,polars

Big Data

매 한 줄

"매 single-machine 의 fit 의 X · single-pass 의 fit 의 X — 매 distributed compute · columnar storage 의 require". 2003 Google MapReduce 논문 의 origin, Hadoop → Spark → Lakehouse (Iceberg+Delta+Hudi) 의 evolve, 2026 의 single-node DuckDB/Polars 의 "Big Data is dead" 의 movement 의 mainstream.

매 핵심

매 5 V

  • Volume: TB-PB scale.
  • Velocity: streaming · near-real-time.
  • Variety: structured + semi + unstructured.
  • Veracity: data quality · trust.
  • Value: ROI of analytics.

매 stack 2026

  • Storage: object store (S3/GCS) + open table format (Iceberg · Delta · Hudi).
  • Compute: Spark · Trino · DuckDB · Polars · Snowflake · BigQuery.
  • Orchestration: Airflow · Dagster · Prefect.
  • Stream: Kafka · Flink · Kinesis.
  • Catalog: Unity · Polaris · Nessie · Glue.

매 응용

  1. 매 BI 의 dashboard.
  2. ML training pipeline (feature store).
  3. Operational analytics (real-time fraud, ad bidding).

💻 패턴

Iceberg table 의 Spark 에서 의 write

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "s3a://lake/warehouse") \
    .getOrCreate()

df = spark.read.json("s3a://raw/events/*.json")
df.writeTo("local.events.daily").partitionedBy("date").createOrReplace()

DuckDB: 매 single-node "big data" (laptop 의 100GB)

import duckdb

con = duckdb.connect()
con.sql("""
    SELECT user_id, COUNT(*) AS events, SUM(amount) AS revenue
    FROM read_parquet('s3://lake/events/2026/*.parquet')
    WHERE event_type = 'purchase'
    GROUP BY user_id
    ORDER BY revenue DESC
    LIMIT 100
""").show()

Polars: out-of-core lazy

import polars as pl

df = (
    pl.scan_parquet("s3://lake/events/*.parquet")
    .filter(pl.col("event_type") == "purchase")
    .group_by("user_id")
    .agg(pl.len().alias("events"), pl.col("amount").sum().alias("revenue"))
    .sort("revenue", descending=True)
    .limit(100)
)
print(df.collect(streaming=True))
DataStream<Event> events = env.fromSource(kafkaSource, ...);

events.keyBy(Event::userId)
      .window(TumblingEventTimeWindows.of(Time.minutes(5)))
      .aggregate(new RevenueAggregator())
      .sinkTo(icebergSink);

Iceberg time-travel + schema evolution

-- snapshot 의 query
SELECT * FROM events FOR VERSION AS OF 8723649283746;
SELECT * FROM events FOR TIMESTAMP AS OF '2026-05-09 00:00:00';

-- column add (no rewrite)
ALTER TABLE events ADD COLUMN device_id STRING;

-- partition evolution
ALTER TABLE events ADD PARTITION FIELD bucket(16, user_id);

Spark: dynamic partition pruning + AQE

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# AQE 의 plan 의 runtime 의 reoptimize
df = spark.sql("""
    SELECT u.name, e.event_type, COUNT(*)
    FROM events e JOIN users u ON e.user_id = u.id
    WHERE e.date >= '2026-05-01'
    GROUP BY u.name, e.event_type
""")

매 결정 기준

상황 Stack
< 100 GB · single node DuckDB · Polars
100GB - 10TB batch Spark + Iceberg
> 10 TB / day Spark/Trino + Iceberg + Snowflake
Streaming < 1s latency Flink + Kafka
Ad-hoc SQL Trino · DuckDB
ML training Spark + Petastorm 또는 Ray Data

기본값: Iceberg-on-S3 + Spark/DuckDB 의 hybrid — 매 modern lakehouse 의 standard.

🔗 Graph

🤖 LLM 활용

언제: SQL 의 generate, partitioning strategy 의 advise, schema evolution diff 의 explain, Iceberg 의 table maintenance 의 query 의 draft. 언제 X: 매 production tuning (shuffle partition · executor sizing) — metric-driven 의 require, LLM hint 의 starting point 만.

안티패턴

  • Premature distribution: 매 < 50GB 의 case 의 Spark — DuckDB 의 100x faster.
  • Small file problem: Spark 의 1KB parquet 의 millions — compaction 의 require.
  • Hive-style partition explosion: 매 high-cardinality column 의 partition (e.g. user_id) — Iceberg bucket transform 의 use.
  • Schema-on-read 의 over-rely: governance 의 erode — open table format 의 use.

🧪 검증 / 중복

  • Verified (Apache Iceberg/Spark/Flink docs, MotherDuck "Big Data is Dead" 2023, Databricks Lakehouse paper).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — 5V + 2026 lakehouse stack + DuckDB/Polars/Flink patterns