Files
2nd/10_Wiki/Topics/Architecture/Big-Data.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

168 lines
5.4 KiB
Markdown

---
id: wiki-2026-0508-big-data
title: Big Data
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Big Data, Large-Scale Data Processing]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [data, distributed-systems, analytics, lakehouse]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: spark,duckdb,iceberg,polars
---
# Big Data
## 매 한 줄
> **"매 single-machine 의 fit 의 X · single-pass 의 fit 의 X — 매 distributed compute · columnar storage 의 require"**. 2003 Google MapReduce 논문 의 origin, Hadoop → Spark → Lakehouse (Iceberg+Delta+Hudi) 의 evolve, 2026 의 single-node DuckDB/Polars 의 "Big Data is dead" 의 movement 의 mainstream.
## 매 핵심
### 매 5 V
- **Volume**: TB-PB scale.
- **Velocity**: streaming · near-real-time.
- **Variety**: structured + semi + unstructured.
- **Veracity**: data quality · trust.
- **Value**: ROI of analytics.
### 매 stack 2026
- **Storage**: object store (S3/GCS) + open table format (Iceberg · Delta · Hudi).
- **Compute**: Spark · Trino · DuckDB · Polars · Snowflake · BigQuery.
- **Orchestration**: Airflow · Dagster · Prefect.
- **Stream**: Kafka · Flink · Kinesis.
- **Catalog**: Unity · Polaris · Nessie · Glue.
### 매 응용
1. 매 BI 의 dashboard.
2. ML training pipeline (feature store).
3. Operational analytics (real-time fraud, ad bidding).
## 💻 패턴
### Iceberg table 의 Spark 에서 의 write
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "s3a://lake/warehouse") \
.getOrCreate()
df = spark.read.json("s3a://raw/events/*.json")
df.writeTo("local.events.daily").partitionedBy("date").createOrReplace()
```
### DuckDB: 매 single-node "big data" (laptop 의 100GB)
```python
import duckdb
con = duckdb.connect()
con.sql("""
SELECT user_id, COUNT(*) AS events, SUM(amount) AS revenue
FROM read_parquet('s3://lake/events/2026/*.parquet')
WHERE event_type = 'purchase'
GROUP BY user_id
ORDER BY revenue DESC
LIMIT 100
""").show()
```
### Polars: out-of-core lazy
```python
import polars as pl
df = (
pl.scan_parquet("s3://lake/events/*.parquet")
.filter(pl.col("event_type") == "purchase")
.group_by("user_id")
.agg(pl.len().alias("events"), pl.col("amount").sum().alias("revenue"))
.sort("revenue", descending=True)
.limit(100)
)
print(df.collect(streaming=True))
```
### Flink: streaming aggregation
```java
DataStream<Event> events = env.fromSource(kafkaSource, ...);
events.keyBy(Event::userId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new RevenueAggregator())
.sinkTo(icebergSink);
```
### Iceberg time-travel + schema evolution
```sql
-- snapshot 의 query
SELECT * FROM events FOR VERSION AS OF 8723649283746;
SELECT * FROM events FOR TIMESTAMP AS OF '2026-05-09 00:00:00';
-- column add (no rewrite)
ALTER TABLE events ADD COLUMN device_id STRING;
-- partition evolution
ALTER TABLE events ADD PARTITION FIELD bucket(16, user_id);
```
### Spark: dynamic partition pruning + AQE
```python
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# AQE 의 plan 의 runtime 의 reoptimize
df = spark.sql("""
SELECT u.name, e.event_type, COUNT(*)
FROM events e JOIN users u ON e.user_id = u.id
WHERE e.date >= '2026-05-01'
GROUP BY u.name, e.event_type
""")
```
## 매 결정 기준
| 상황 | Stack |
|---|---|
| < 100 GB · single node | DuckDB · Polars |
| 100GB - 10TB batch | Spark + Iceberg |
| > 10 TB / day | Spark/Trino + Iceberg + Snowflake |
| Streaming < 1s latency | Flink + Kafka |
| Ad-hoc SQL | Trino · DuckDB |
| ML training | Spark + Petastorm 또는 Ray Data |
**기본값**: Iceberg-on-S3 + Spark/DuckDB 의 hybrid — 매 modern lakehouse 의 standard.
## 🔗 Graph
- 부모: [[Distributed Systems]] · [[Data Engineering]]
- 변형: [[Lakehouse]] · [[Data Warehouse]]
- 응용: [[Apache Ignite]]
- Adjacent: [[Append-only log]] · [[Stream-Processing-Architectures|Stream Processing]]
## 🤖 LLM 활용
**언제**: SQL 의 generate, partitioning strategy 의 advise, schema evolution diff 의 explain, Iceberg 의 table maintenance 의 query 의 draft.
**언제 X**: 매 production tuning (shuffle partition · executor sizing) — metric-driven 의 require, LLM hint 의 starting point 만.
## ❌ 안티패턴
- **Premature distribution**: 매 < 50GB 의 case 의 Spark — DuckDB 의 100x faster.
- **Small file problem**: Spark 의 1KB parquet 의 millions — compaction 의 require.
- **Hive-style partition explosion**: 매 high-cardinality column 의 partition (e.g. user_id) — Iceberg bucket transform 의 use.
- **Schema-on-read 의 over-rely**: governance 의 erode — open table format 의 use.
## 🧪 검증 / 중복
- Verified (Apache Iceberg/Spark/Flink docs, MotherDuck "Big Data is Dead" 2023, Databricks Lakehouse paper).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — 5V + 2026 lakehouse stack + DuckDB/Polars/Flink patterns |