f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
168 lines
5.4 KiB
Markdown
168 lines
5.4 KiB
Markdown
---
|
|
id: wiki-2026-0508-big-data
|
|
title: Big Data
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Big Data, Large-Scale Data Processing]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [data, distributed-systems, analytics, lakehouse]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: spark,duckdb,iceberg,polars
|
|
---
|
|
|
|
# Big Data
|
|
|
|
## 매 한 줄
|
|
> **"매 single-machine 의 fit 의 X · single-pass 의 fit 의 X — 매 distributed compute · columnar storage 의 require"**. 2003 Google MapReduce 논문 의 origin, Hadoop → Spark → Lakehouse (Iceberg+Delta+Hudi) 의 evolve, 2026 의 single-node DuckDB/Polars 의 "Big Data is dead" 의 movement 의 mainstream.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 5 V
|
|
- **Volume**: TB-PB scale.
|
|
- **Velocity**: streaming · near-real-time.
|
|
- **Variety**: structured + semi + unstructured.
|
|
- **Veracity**: data quality · trust.
|
|
- **Value**: ROI of analytics.
|
|
|
|
### 매 stack 2026
|
|
- **Storage**: object store (S3/GCS) + open table format (Iceberg · Delta · Hudi).
|
|
- **Compute**: Spark · Trino · DuckDB · Polars · Snowflake · BigQuery.
|
|
- **Orchestration**: Airflow · Dagster · Prefect.
|
|
- **Stream**: Kafka · Flink · Kinesis.
|
|
- **Catalog**: Unity · Polaris · Nessie · Glue.
|
|
|
|
### 매 응용
|
|
1. 매 BI 의 dashboard.
|
|
2. ML training pipeline (feature store).
|
|
3. Operational analytics (real-time fraud, ad bidding).
|
|
|
|
## 💻 패턴
|
|
|
|
### Iceberg table 의 Spark 에서 의 write
|
|
```python
|
|
from pyspark.sql import SparkSession
|
|
|
|
spark = SparkSession.builder \
|
|
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
|
|
.config("spark.sql.catalog.local.type", "hadoop") \
|
|
.config("spark.sql.catalog.local.warehouse", "s3a://lake/warehouse") \
|
|
.getOrCreate()
|
|
|
|
df = spark.read.json("s3a://raw/events/*.json")
|
|
df.writeTo("local.events.daily").partitionedBy("date").createOrReplace()
|
|
```
|
|
|
|
### DuckDB: 매 single-node "big data" (laptop 의 100GB)
|
|
```python
|
|
import duckdb
|
|
|
|
con = duckdb.connect()
|
|
con.sql("""
|
|
SELECT user_id, COUNT(*) AS events, SUM(amount) AS revenue
|
|
FROM read_parquet('s3://lake/events/2026/*.parquet')
|
|
WHERE event_type = 'purchase'
|
|
GROUP BY user_id
|
|
ORDER BY revenue DESC
|
|
LIMIT 100
|
|
""").show()
|
|
```
|
|
|
|
### Polars: out-of-core lazy
|
|
```python
|
|
import polars as pl
|
|
|
|
df = (
|
|
pl.scan_parquet("s3://lake/events/*.parquet")
|
|
.filter(pl.col("event_type") == "purchase")
|
|
.group_by("user_id")
|
|
.agg(pl.len().alias("events"), pl.col("amount").sum().alias("revenue"))
|
|
.sort("revenue", descending=True)
|
|
.limit(100)
|
|
)
|
|
print(df.collect(streaming=True))
|
|
```
|
|
|
|
### Flink: streaming aggregation
|
|
```java
|
|
DataStream<Event> events = env.fromSource(kafkaSource, ...);
|
|
|
|
events.keyBy(Event::userId)
|
|
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
|
|
.aggregate(new RevenueAggregator())
|
|
.sinkTo(icebergSink);
|
|
```
|
|
|
|
### Iceberg time-travel + schema evolution
|
|
```sql
|
|
-- snapshot 의 query
|
|
SELECT * FROM events FOR VERSION AS OF 8723649283746;
|
|
SELECT * FROM events FOR TIMESTAMP AS OF '2026-05-09 00:00:00';
|
|
|
|
-- column add (no rewrite)
|
|
ALTER TABLE events ADD COLUMN device_id STRING;
|
|
|
|
-- partition evolution
|
|
ALTER TABLE events ADD PARTITION FIELD bucket(16, user_id);
|
|
```
|
|
|
|
### Spark: dynamic partition pruning + AQE
|
|
```python
|
|
spark.conf.set("spark.sql.adaptive.enabled", "true")
|
|
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
|
|
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
|
|
|
|
# AQE 의 plan 의 runtime 의 reoptimize
|
|
df = spark.sql("""
|
|
SELECT u.name, e.event_type, COUNT(*)
|
|
FROM events e JOIN users u ON e.user_id = u.id
|
|
WHERE e.date >= '2026-05-01'
|
|
GROUP BY u.name, e.event_type
|
|
""")
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Stack |
|
|
|---|---|
|
|
| < 100 GB · single node | DuckDB · Polars |
|
|
| 100GB - 10TB batch | Spark + Iceberg |
|
|
| > 10 TB / day | Spark/Trino + Iceberg + Snowflake |
|
|
| Streaming < 1s latency | Flink + Kafka |
|
|
| Ad-hoc SQL | Trino · DuckDB |
|
|
| ML training | Spark + Petastorm 또는 Ray Data |
|
|
|
|
**기본값**: Iceberg-on-S3 + Spark/DuckDB 의 hybrid — 매 modern lakehouse 의 standard.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Distributed Systems]] · [[Data Engineering]]
|
|
- 변형: [[Lakehouse]] · [[Data Warehouse]]
|
|
- 응용: [[Apache Ignite]]
|
|
- Adjacent: [[Append-only log]] · [[Stream-Processing-Architectures|Stream Processing]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: SQL 의 generate, partitioning strategy 의 advise, schema evolution diff 의 explain, Iceberg 의 table maintenance 의 query 의 draft.
|
|
**언제 X**: 매 production tuning (shuffle partition · executor sizing) — metric-driven 의 require, LLM hint 의 starting point 만.
|
|
|
|
## ❌ 안티패턴
|
|
- **Premature distribution**: 매 < 50GB 의 case 의 Spark — DuckDB 의 100x faster.
|
|
- **Small file problem**: Spark 의 1KB parquet 의 millions — compaction 의 require.
|
|
- **Hive-style partition explosion**: 매 high-cardinality column 의 partition (e.g. user_id) — Iceberg bucket transform 의 use.
|
|
- **Schema-on-read 의 over-rely**: governance 의 erode — open table format 의 use.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Apache Iceberg/Spark/Flink docs, MotherDuck "Big Data is Dead" 2023, Databricks Lakehouse paper).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — 5V + 2026 lakehouse stack + DuckDB/Polars/Flink patterns |
|