[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,276 @@
|
||||
---
|
||||
id: data-eng-lakehouse
|
||||
title: Lakehouse — Iceberg / Delta / Parquet
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [data-engineering, lakehouse, iceberg, parquet, vibe-coding]
|
||||
tech_stack: { language: "SQL / Python", applicable_to: ["Data Engineering"] }
|
||||
applied_in: []
|
||||
aliases: [Apache Iceberg, Delta Lake, Hudi, Parquet, lakehouse, ACID on object storage]
|
||||
---
|
||||
|
||||
# Lakehouse (Iceberg / Delta / Hudi)
|
||||
|
||||
> Object storage (S3) + table format = warehouse 의 transaction + lake 의 cost. **Apache Iceberg = open standard, Delta Lake (Databricks), Hudi**. Spark / Trino / DuckDB / DataFusion 가 query.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Parquet: 컬럼 binary format, 압축.
|
||||
- Table format: metadata layer — schema, snapshot, ACID.
|
||||
- Time travel: 옛 snapshot query.
|
||||
- Merge-on-Read vs Copy-on-Write.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Parquet (기본 file format)
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
df = pd.DataFrame({'id': [1, 2, 3], 'name': ['a', 'b', 'c']})
|
||||
df.to_parquet('s3://bucket/data.parquet', engine='pyarrow', compression='zstd')
|
||||
|
||||
# Read
|
||||
df = pd.read_parquet('s3://bucket/data.parquet')
|
||||
```
|
||||
|
||||
→ Compression 자동, 컬럼 단위 read 가능.
|
||||
|
||||
### Apache Iceberg (Spark)
|
||||
```python
|
||||
from pyspark.sql import SparkSession
|
||||
|
||||
spark = SparkSession.builder \
|
||||
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
|
||||
.config('spark.sql.catalog.cat', 'org.apache.iceberg.spark.SparkCatalog') \
|
||||
.config('spark.sql.catalog.cat.type', 'hadoop') \
|
||||
.config('spark.sql.catalog.cat.warehouse', 's3://bucket/warehouse') \
|
||||
.getOrCreate()
|
||||
|
||||
# 테이블 생성
|
||||
spark.sql('''
|
||||
CREATE TABLE cat.db.orders (
|
||||
id BIGINT,
|
||||
user_id STRING,
|
||||
amount DECIMAL(10, 2),
|
||||
created_at TIMESTAMP
|
||||
) USING iceberg
|
||||
PARTITIONED BY (days(created_at))
|
||||
''')
|
||||
|
||||
# Insert
|
||||
spark.sql("INSERT INTO cat.db.orders VALUES (1, 'u1', 99.50, '2026-05-09')")
|
||||
|
||||
# Time travel
|
||||
spark.sql("SELECT * FROM cat.db.orders VERSION AS OF 12345")
|
||||
spark.sql("SELECT * FROM cat.db.orders TIMESTAMP AS OF '2026-05-01'")
|
||||
```
|
||||
|
||||
### Iceberg with Trino / Athena / DuckDB
|
||||
```sql
|
||||
-- Trino
|
||||
CREATE TABLE iceberg.db.orders (...)
|
||||
WITH (format = 'PARQUET', partitioning = ARRAY['day(created_at)']);
|
||||
|
||||
-- DuckDB (modern, lightweight)
|
||||
INSTALL iceberg;
|
||||
LOAD iceberg;
|
||||
SELECT * FROM iceberg_scan('s3://bucket/orders');
|
||||
```
|
||||
|
||||
### Schema evolution
|
||||
```sql
|
||||
ALTER TABLE cat.db.orders ADD COLUMN status STRING;
|
||||
ALTER TABLE cat.db.orders RENAME COLUMN amount TO total;
|
||||
ALTER TABLE cat.db.orders DROP COLUMN status;
|
||||
```
|
||||
|
||||
→ 옛 file 도 호환. 안전.
|
||||
|
||||
### Partition evolution
|
||||
```sql
|
||||
ALTER TABLE cat.db.orders REPLACE PARTITION FIELD days(created_at) WITH hours(created_at);
|
||||
```
|
||||
|
||||
→ 옛 data 그대로. 새 data 만 새 partition 으로.
|
||||
|
||||
### Compaction (작은 file → 큰 file)
|
||||
```sql
|
||||
CALL cat.system.rewrite_data_files('db.orders');
|
||||
```
|
||||
|
||||
→ Small file 문제 해결.
|
||||
|
||||
### MERGE INTO (UPSERT)
|
||||
```sql
|
||||
MERGE INTO cat.db.orders t
|
||||
USING new_orders s
|
||||
ON t.id = s.id
|
||||
WHEN MATCHED THEN UPDATE SET *
|
||||
WHEN NOT MATCHED THEN INSERT *;
|
||||
```
|
||||
|
||||
### Snapshot 관리
|
||||
```sql
|
||||
-- 옛 snapshot 만료 (storage 절약)
|
||||
CALL cat.system.expire_snapshots('db.orders', TIMESTAMP '2026-04-01');
|
||||
|
||||
-- 옛 file 정리
|
||||
CALL cat.system.remove_orphan_files('db.orders');
|
||||
```
|
||||
|
||||
### Delta Lake (Databricks 친화)
|
||||
```python
|
||||
from delta import configure_spark_with_delta_pip
|
||||
|
||||
builder = SparkSession.builder.config(
|
||||
"spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtensions"
|
||||
).config(
|
||||
"spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"
|
||||
)
|
||||
spark = configure_spark_with_delta_pip(builder).getOrCreate()
|
||||
|
||||
spark.sql('CREATE TABLE db.orders (...) USING DELTA')
|
||||
spark.sql('SELECT * FROM db.orders VERSION AS OF 5')
|
||||
```
|
||||
|
||||
```python
|
||||
# Python API
|
||||
from delta.tables import DeltaTable
|
||||
|
||||
dt = DeltaTable.forPath(spark, '/path/to/orders')
|
||||
dt.alias('t').merge(
|
||||
new_data.alias('s'),
|
||||
't.id = s.id'
|
||||
).whenMatchedUpdateAll() \
|
||||
.whenNotMatchedInsertAll() \
|
||||
.execute()
|
||||
|
||||
# Time travel
|
||||
df = spark.read.format('delta').option('versionAsOf', 5).load('/path')
|
||||
```
|
||||
|
||||
### Iceberg vs Delta vs Hudi
|
||||
```
|
||||
Iceberg:
|
||||
+ 가장 open (Apache, vendor-neutral)
|
||||
+ Schema/partition evolution 강
|
||||
+ 큰 ecosystem (Snowflake, BigQuery, AWS, Trino)
|
||||
|
||||
Delta Lake:
|
||||
+ Databricks native
|
||||
+ Modern features 빠름
|
||||
- Open source 정도 (DI 전체 X)
|
||||
|
||||
Hudi:
|
||||
+ Streaming 친화
|
||||
+ Merge-on-Read 강
|
||||
- 작은 community (vs Iceberg)
|
||||
```
|
||||
|
||||
→ **2026 현재 = Iceberg 가 표준 추세**.
|
||||
|
||||
### Streaming → Lakehouse
|
||||
```python
|
||||
# Spark Structured Streaming
|
||||
stream = spark.readStream.format('kafka').option(...).load()
|
||||
parsed = stream.selectExpr('CAST(value AS STRING) as json').select(from_json('json', schema).alias('d'))
|
||||
flat = parsed.select('d.*')
|
||||
|
||||
flat.writeStream \
|
||||
.format('iceberg') \
|
||||
.outputMode('append') \
|
||||
.option('path', 'cat.db.events') \
|
||||
.option('checkpointLocation', 's3://checkpoints/events') \
|
||||
.trigger(processingTime='1 minute') \
|
||||
.start()
|
||||
```
|
||||
|
||||
→ Real-time → Iceberg.
|
||||
|
||||
### CDC ingestion (Debezium → Iceberg)
|
||||
```
|
||||
DB → Debezium → Kafka → Spark / Flink → Iceberg
|
||||
```
|
||||
|
||||
### File layout
|
||||
```
|
||||
s3://bucket/warehouse/db/orders/
|
||||
├── data/
|
||||
│ ├── year=2026/month=05/day=09/file-uuid.parquet
|
||||
│ └── ...
|
||||
└── metadata/
|
||||
├── snap-xxx.avro (snapshot)
|
||||
├── manifest-yyy.avro (manifest list)
|
||||
└── v1.metadata.json (version pointer)
|
||||
```
|
||||
|
||||
### Catalog (REST / Hive / Glue / Nessie)
|
||||
```
|
||||
Hive Metastore — legacy
|
||||
AWS Glue — AWS native
|
||||
REST catalog — Iceberg 표준
|
||||
Nessie — git-like branching
|
||||
Polaris — open
|
||||
Tabular — managed
|
||||
```
|
||||
|
||||
```python
|
||||
# Nessie — branch / merge
|
||||
spark.sql("CREATE BRANCH dev IN cat FROM main")
|
||||
spark.sql("USE REFERENCE dev IN cat")
|
||||
# Dev 환경 — production 영향 X
|
||||
```
|
||||
|
||||
### Cost
|
||||
```
|
||||
S3 storage: $23/TB/month (Standard)
|
||||
Glacier: $4/TB/month (cold)
|
||||
|
||||
vs warehouse:
|
||||
Snowflake: $40+/TB/month (compute 별도)
|
||||
BigQuery: $20/TB/month + $6.25/TB query
|
||||
```
|
||||
|
||||
→ Lakehouse = 큰 cost 절감.
|
||||
|
||||
### Compute engines
|
||||
```
|
||||
Spark: 표준 batch
|
||||
Flink: streaming
|
||||
Trino: interactive query
|
||||
DuckDB: single-node, fast
|
||||
DataFusion: Rust, embeddable
|
||||
Snowflake / BigQuery: 외부 catalog 통해 query
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 상황 | 추천 |
|
||||
|---|---|
|
||||
| 새 lake | Iceberg |
|
||||
| Databricks | Delta Lake |
|
||||
| Streaming heavy | Hudi 또는 Iceberg + Flink |
|
||||
| 작은 / 단일 노드 | DuckDB + Parquet |
|
||||
| Compute analytic | Trino / Spark |
|
||||
| Managed | Snowflake / BigQuery / Databricks |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **CSV / JSON prod**: parse 비싸, schema 약함. Parquet.
|
||||
- **작은 file 많음**: query slow. Compaction.
|
||||
- **Partition 너무 잘게**: 너무 많은 file.
|
||||
- **Snapshot expire 안 함**: storage 폭발.
|
||||
- **Schema 무관 INSERT**: 깨짐. enforce.
|
||||
- **Direct S3 write 동기화 X**: race. transactional.
|
||||
- **Catalog 없음 — file path 직접**: schema 추적 안 됨.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Iceberg + S3 + Trino/Spark 가 modern OSS stack.
|
||||
- Catalog (Glue / Nessie / Polaris).
|
||||
- Compaction + snapshot expire 정기.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[Data_Eng_dbt]]
|
||||
- [[Data_Eng_Airflow_Dagster]]
|
||||
- [[DB_ClickHouse_OLAP]]
|
||||
Reference in New Issue
Block a user