--- id: wiki-2026-0508-big-data title: Big Data category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Big Data, Large-Scale Data Processing] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [data, distributed-systems, analytics, lakehouse] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: spark,duckdb,iceberg,polars --- # Big Data ## 매 한 줄 > **"매 single-machine 의 fit 의 X · single-pass 의 fit 의 X — 매 distributed compute · columnar storage 의 require"**. 2003 Google MapReduce 논문 의 origin, Hadoop → Spark → Lakehouse (Iceberg+Delta+Hudi) 의 evolve, 2026 의 single-node DuckDB/Polars 의 "Big Data is dead" 의 movement 의 mainstream. ## 매 핵심 ### 매 5 V - **Volume**: TB-PB scale. - **Velocity**: streaming · near-real-time. - **Variety**: structured + semi + unstructured. - **Veracity**: data quality · trust. - **Value**: ROI of analytics. ### 매 stack 2026 - **Storage**: object store (S3/GCS) + open table format (Iceberg · Delta · Hudi). - **Compute**: Spark · Trino · DuckDB · Polars · Snowflake · BigQuery. - **Orchestration**: Airflow · Dagster · Prefect. - **Stream**: Kafka · Flink · Kinesis. - **Catalog**: Unity · Polaris · Nessie · Glue. ### 매 응용 1. 매 BI 의 dashboard. 2. ML training pipeline (feature store). 3. Operational analytics (real-time fraud, ad bidding). ## 💻 패턴 ### Iceberg table 의 Spark 에서 의 write ```python from pyspark.sql import SparkSession spark = SparkSession.builder \ .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \ .config("spark.sql.catalog.local.type", "hadoop") \ .config("spark.sql.catalog.local.warehouse", "s3a://lake/warehouse") \ .getOrCreate() df = spark.read.json("s3a://raw/events/*.json") df.writeTo("local.events.daily").partitionedBy("date").createOrReplace() ``` ### DuckDB: 매 single-node "big data" (laptop 의 100GB) ```python import duckdb con = duckdb.connect() con.sql(""" SELECT user_id, COUNT(*) AS events, SUM(amount) AS revenue FROM read_parquet('s3://lake/events/2026/*.parquet') WHERE event_type = 'purchase' GROUP BY user_id ORDER BY revenue DESC LIMIT 100 """).show() ``` ### Polars: out-of-core lazy ```python import polars as pl df = ( pl.scan_parquet("s3://lake/events/*.parquet") .filter(pl.col("event_type") == "purchase") .group_by("user_id") .agg(pl.len().alias("events"), pl.col("amount").sum().alias("revenue")) .sort("revenue", descending=True) .limit(100) ) print(df.collect(streaming=True)) ``` ### Flink: streaming aggregation ```java DataStream events = env.fromSource(kafkaSource, ...); events.keyBy(Event::userId) .window(TumblingEventTimeWindows.of(Time.minutes(5))) .aggregate(new RevenueAggregator()) .sinkTo(icebergSink); ``` ### Iceberg time-travel + schema evolution ```sql -- snapshot 의 query SELECT * FROM events FOR VERSION AS OF 8723649283746; SELECT * FROM events FOR TIMESTAMP AS OF '2026-05-09 00:00:00'; -- column add (no rewrite) ALTER TABLE events ADD COLUMN device_id STRING; -- partition evolution ALTER TABLE events ADD PARTITION FIELD bucket(16, user_id); ``` ### Spark: dynamic partition pruning + AQE ```python spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true") # AQE 의 plan 의 runtime 의 reoptimize df = spark.sql(""" SELECT u.name, e.event_type, COUNT(*) FROM events e JOIN users u ON e.user_id = u.id WHERE e.date >= '2026-05-01' GROUP BY u.name, e.event_type """) ``` ## 매 결정 기준 | 상황 | Stack | |---|---| | < 100 GB · single node | DuckDB · Polars | | 100GB - 10TB batch | Spark + Iceberg | | > 10 TB / day | Spark/Trino + Iceberg + Snowflake | | Streaming < 1s latency | Flink + Kafka | | Ad-hoc SQL | Trino · DuckDB | | ML training | Spark + Petastorm 또는 Ray Data | **기본값**: Iceberg-on-S3 + Spark/DuckDB 의 hybrid — 매 modern lakehouse 의 standard. ## 🔗 Graph - 부모: [[Distributed Systems]] · [[Data Engineering]] - 변형: [[Lakehouse]] · [[Data Warehouse]] - 응용: [[Apache Ignite]] - Adjacent: [[Append-only log]] · [[Stream-Processing-Architectures|Stream Processing]] ## 🤖 LLM 활용 **언제**: SQL 의 generate, partitioning strategy 의 advise, schema evolution diff 의 explain, Iceberg 의 table maintenance 의 query 의 draft. **언제 X**: 매 production tuning (shuffle partition · executor sizing) — metric-driven 의 require, LLM hint 의 starting point 만. ## ❌ 안티패턴 - **Premature distribution**: 매 < 50GB 의 case 의 Spark — DuckDB 의 100x faster. - **Small file problem**: Spark 의 1KB parquet 의 millions — compaction 의 require. - **Hive-style partition explosion**: 매 high-cardinality column 의 partition (e.g. user_id) — Iceberg bucket transform 의 use. - **Schema-on-read 의 over-rely**: governance 의 erode — open table format 의 use. ## 🧪 검증 / 중복 - Verified (Apache Iceberg/Spark/Flink docs, MotherDuck "Big Data is Dead" 2023, Databricks Lakehouse paper). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — 5V + 2026 lakehouse stack + DuckDB/Polars/Flink patterns |