"매 single-machine 의 fit 의 X · single-pass 의 fit 의 X — 매 distributed compute · columnar storage 의 require". 2003 Google MapReduce 논문 의 origin, Hadoop → Spark → Lakehouse (Iceberg+Delta+Hudi) 의 evolve, 2026 의 single-node DuckDB/Polars 의 "Big Data is dead" 의 movement 의 mainstream.
매 핵심
매 5 V
Volume: TB-PB scale.
Velocity: streaming · near-real-time.
Variety: structured + semi + unstructured.
Veracity: data quality · trust.
Value: ROI of analytics.
매 stack 2026
Storage: object store (S3/GCS) + open table format (Iceberg · Delta · Hudi).
importduckdbcon=duckdb.connect()con.sql("""
SELECT user_id, COUNT(*) AS events, SUM(amount) AS revenue
FROM read_parquet('s3://lake/events/2026/*.parquet')
WHERE event_type = 'purchase'
GROUP BY user_id
ORDER BY revenue DESC
LIMIT 100
""").show()
-- snapshot 의 query
SELECT*FROMeventsFORVERSIONASOF8723649283746;SELECT*FROMeventsFORTIMESTAMPASOF'2026-05-09 00:00:00';-- column add (no rewrite)
ALTERTABLEeventsADDCOLUMNdevice_idSTRING;-- partition evolution
ALTERTABLEeventsADDPARTITIONFIELDbucket(16,user_id);
Spark: dynamic partition pruning + AQE
spark.conf.set("spark.sql.adaptive.enabled","true")spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled","true")spark.conf.set("spark.sql.adaptive.skewJoin.enabled","true")# AQE 의 plan 의 runtime 의 reoptimizedf=spark.sql("""
SELECT u.name, e.event_type, COUNT(*)
FROM events e JOIN users u ON e.user_id = u.id
WHERE e.date >= '2026-05-01'
GROUP BY u.name, e.event_type
""")
매 결정 기준
상황
Stack
< 100 GB · single node
DuckDB · Polars
100GB - 10TB batch
Spark + Iceberg
> 10 TB / day
Spark/Trino + Iceberg + Snowflake
Streaming < 1s latency
Flink + Kafka
Ad-hoc SQL
Trino · DuckDB
ML training
Spark + Petastorm 또는 Ray Data
기본값: Iceberg-on-S3 + Spark/DuckDB 의 hybrid — 매 modern lakehouse 의 standard.
언제: SQL 의 generate, partitioning strategy 의 advise, schema evolution diff 의 explain, Iceberg 의 table maintenance 의 query 의 draft.
언제 X: 매 production tuning (shuffle partition · executor sizing) — metric-driven 의 require, LLM hint 의 starting point 만.
❌ 안티패턴
Premature distribution: 매 < 50GB 의 case 의 Spark — DuckDB 의 100x faster.
Small file problem: Spark 의 1KB parquet 의 millions — compaction 의 require.
Hive-style partition explosion: 매 high-cardinality column 의 partition (e.g. user_id) — Iceberg bucket transform 의 use.
Schema-on-read 의 over-rely: governance 의 erode — open table format 의 use.
🧪 검증 / 중복
Verified (Apache Iceberg/Spark/Flink docs, MotherDuck "Big Data is Dead" 2023, Databricks Lakehouse paper).