2nd/10_Wiki/Topics/Architecture/Big-Data.md

---
id: wiki-2026-0508-big-data
title: Big Data
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Big Data, Large-Scale Data Processing]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [data, distributed-systems, analytics, lakehouse]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: spark,duckdb,iceberg,polars
---

# Big Data

## 매 한 줄
> **"매 single-machine 의 fit 의 X · single-pass 의 fit 의 X — 매 distributed compute · columnar storage 의 require"**. 2003 Google MapReduce 논문 의 origin, Hadoop → Spark → Lakehouse (Iceberg+Delta+Hudi) 의 evolve, 2026 의 single-node DuckDB/Polars 의 "Big Data is dead" 의 movement 의 mainstream.

## 매 핵심

### 매 5 V
- **Volume**: TB-PB scale.
- **Velocity**: streaming · near-real-time.
- **Variety**: structured + semi + unstructured.
- **Veracity**: data quality · trust.
- **Value**: ROI of analytics.

### 매 stack 2026
- **Storage**: object store (S3/GCS) + open table format (Iceberg · Delta · Hudi).
- **Compute**: Spark · Trino · DuckDB · Polars · Snowflake · BigQuery.
- **Orchestration**: Airflow · Dagster · Prefect.
- **Stream**: Kafka · Flink · Kinesis.
- **Catalog**: Unity · Polaris · Nessie · Glue.

### 매 응용
1. 매 BI 의 dashboard.
2. ML training pipeline (feature store).
3. Operational analytics (real-time fraud, ad bidding).

## 💻 패턴

### Iceberg table 의 Spark 에서 의 write
```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "s3a://lake/warehouse") \
    .getOrCreate()

df = spark.read.json("s3a://raw/events/*.json")
df.writeTo("local.events.daily").partitionedBy("date").createOrReplace()
```

### DuckDB: 매 single-node "big data" (laptop 의 100GB)
```python
import duckdb

con = duckdb.connect()
con.sql("""
    SELECT user_id, COUNT(*) AS events, SUM(amount) AS revenue
    FROM read_parquet('s3://lake/events/2026/*.parquet')
    WHERE event_type = 'purchase'
    GROUP BY user_id
    ORDER BY revenue DESC
    LIMIT 100
""").show()
```

### Polars: out-of-core lazy
```python
import polars as pl

df = (
    pl.scan_parquet("s3://lake/events/*.parquet")
    .filter(pl.col("event_type") == "purchase")
    .group_by("user_id")
    .agg(pl.len().alias("events"), pl.col("amount").sum().alias("revenue"))
    .sort("revenue", descending=True)
    .limit(100)
)
print(df.collect(streaming=True))
```

### Flink: streaming aggregation
```java
DataStream<Event> events = env.fromSource(kafkaSource, ...);

events.keyBy(Event::userId)
      .window(TumblingEventTimeWindows.of(Time.minutes(5)))
      .aggregate(new RevenueAggregator())
      .sinkTo(icebergSink);
```

### Iceberg time-travel + schema evolution
```sql
-- snapshot 의 query
SELECT * FROM events FOR VERSION AS OF 8723649283746;
SELECT * FROM events FOR TIMESTAMP AS OF '2026-05-09 00:00:00';

-- column add (no rewrite)
ALTER TABLE events ADD COLUMN device_id STRING;

-- partition evolution
ALTER TABLE events ADD PARTITION FIELD bucket(16, user_id);
```

### Spark: dynamic partition pruning + AQE
```python
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# AQE 의 plan 의 runtime 의 reoptimize
df = spark.sql("""
    SELECT u.name, e.event_type, COUNT(*)
    FROM events e JOIN users u ON e.user_id = u.id
    WHERE e.date >= '2026-05-01'
    GROUP BY u.name, e.event_type
""")
```

## 매 결정 기준
| 상황 | Stack |
|---|---|
| < 100 GB · single node | DuckDB · Polars |
| 100GB - 10TB batch | Spark + Iceberg |
| > 10 TB / day | Spark/Trino + Iceberg + Snowflake |
| Streaming < 1s latency | Flink + Kafka |
| Ad-hoc SQL | Trino · DuckDB |
| ML training | Spark + Petastorm 또는 Ray Data |

**기본값**: Iceberg-on-S3 + Spark/DuckDB 의 hybrid — 매 modern lakehouse 의 standard.

## 🔗 Graph
- 부모: [[Distributed Systems]] · [[Data Engineering]]
- 변형: [[Lakehouse]] · [[Data Warehouse]]
- 응용: [[Apache Ignite]]
- Adjacent: [[Append-only log]] · [[Stream-Processing-Architectures|Stream Processing]]

## 🤖 LLM 활용
**언제**: SQL 의 generate, partitioning strategy 의 advise, schema evolution diff 의 explain, Iceberg 의 table maintenance 의 query 의 draft.
**언제 X**: 매 production tuning (shuffle partition · executor sizing) — metric-driven 의 require, LLM hint 의 starting point 만.

## ❌ 안티패턴
- **Premature distribution**: 매 < 50GB 의 case 의 Spark — DuckDB 의 100x faster.
- **Small file problem**: Spark 의 1KB parquet 의 millions — compaction 의 require.
- **Hive-style partition explosion**: 매 high-cardinality column 의 partition (e.g. user_id) — Iceberg bucket transform 의 use.
- **Schema-on-read 의 over-rely**: governance 의 erode — open table format 의 use.

## 🧪 검증 / 중복
- Verified (Apache Iceberg/Spark/Flink docs, MotherDuck "Big Data is Dead" 2023, Databricks Lakehouse paper).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — 5V + 2026 lakehouse stack + DuckDB/Polars/Flink patterns |