Files
2nd/10_Wiki/Topics/Coding/Data_Eng_Lakehouse.md
T
2026-05-09 21:08:02 +09:00

7.1 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
data-eng-lakehouse Lakehouse — Iceberg / Delta / Parquet Coding draft B conceptual 2026-05-09 2026-05-09
data-engineering
lakehouse
iceberg
parquet
vibe-coding
language applicable_to
SQL / Python
Data Engineering
Apache Iceberg
Delta Lake
Hudi
Parquet
lakehouse
ACID on object storage

Lakehouse (Iceberg / Delta / Hudi)

Object storage (S3) + table format = warehouse 의 transaction + lake 의 cost. Apache Iceberg = open standard, Delta Lake (Databricks), Hudi. Spark / Trino / DuckDB / DataFusion 가 query.

📖 핵심 개념

  • Parquet: 컬럼 binary format, 압축.
  • Table format: metadata layer — schema, snapshot, ACID.
  • Time travel: 옛 snapshot query.
  • Merge-on-Read vs Copy-on-Write.

💻 코드 패턴

Parquet (기본 file format)

import pandas as pd

df = pd.DataFrame({'id': [1, 2, 3], 'name': ['a', 'b', 'c']})
df.to_parquet('s3://bucket/data.parquet', engine='pyarrow', compression='zstd')

# Read
df = pd.read_parquet('s3://bucket/data.parquet')

→ Compression 자동, 컬럼 단위 read 가능.

Apache Iceberg (Spark)

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .config('spark.sql.catalog.cat', 'org.apache.iceberg.spark.SparkCatalog') \
    .config('spark.sql.catalog.cat.type', 'hadoop') \
    .config('spark.sql.catalog.cat.warehouse', 's3://bucket/warehouse') \
    .getOrCreate()

# 테이블 생성
spark.sql('''
CREATE TABLE cat.db.orders (
    id BIGINT,
    user_id STRING,
    amount DECIMAL(10, 2),
    created_at TIMESTAMP
) USING iceberg
PARTITIONED BY (days(created_at))
''')

# Insert
spark.sql("INSERT INTO cat.db.orders VALUES (1, 'u1', 99.50, '2026-05-09')")

# Time travel
spark.sql("SELECT * FROM cat.db.orders VERSION AS OF 12345")
spark.sql("SELECT * FROM cat.db.orders TIMESTAMP AS OF '2026-05-01'")

Iceberg with Trino / Athena / DuckDB

-- Trino
CREATE TABLE iceberg.db.orders (...)
WITH (format = 'PARQUET', partitioning = ARRAY['day(created_at)']);

-- DuckDB (modern, lightweight)
INSTALL iceberg;
LOAD iceberg;
SELECT * FROM iceberg_scan('s3://bucket/orders');

Schema evolution

ALTER TABLE cat.db.orders ADD COLUMN status STRING;
ALTER TABLE cat.db.orders RENAME COLUMN amount TO total;
ALTER TABLE cat.db.orders DROP COLUMN status;

→ 옛 file 도 호환. 안전.

Partition evolution

ALTER TABLE cat.db.orders REPLACE PARTITION FIELD days(created_at) WITH hours(created_at);

→ 옛 data 그대로. 새 data 만 새 partition 으로.

Compaction (작은 file → 큰 file)

CALL cat.system.rewrite_data_files('db.orders');

→ Small file 문제 해결.

MERGE INTO (UPSERT)

MERGE INTO cat.db.orders t
USING new_orders s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Snapshot 관리

-- 옛 snapshot 만료 (storage 절약)
CALL cat.system.expire_snapshots('db.orders', TIMESTAMP '2026-04-01');

-- 옛 file 정리
CALL cat.system.remove_orphan_files('db.orders');

Delta Lake (Databricks 친화)

from delta import configure_spark_with_delta_pip

builder = SparkSession.builder.config(
    "spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtensions"
).config(
    "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

spark.sql('CREATE TABLE db.orders (...) USING DELTA')
spark.sql('SELECT * FROM db.orders VERSION AS OF 5')
# Python API
from delta.tables import DeltaTable

dt = DeltaTable.forPath(spark, '/path/to/orders')
dt.alias('t').merge(
    new_data.alias('s'),
    't.id = s.id'
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Time travel
df = spark.read.format('delta').option('versionAsOf', 5).load('/path')

Iceberg vs Delta vs Hudi

Iceberg:
+ 가장 open (Apache, vendor-neutral)
+ Schema/partition evolution 강
+ 큰 ecosystem (Snowflake, BigQuery, AWS, Trino)

Delta Lake:
+ Databricks native
+ Modern features 빠름
- Open source 정도 (DI 전체 X)

Hudi:
+ Streaming 친화
+ Merge-on-Read 강
- 작은 community (vs Iceberg)

2026 현재 = Iceberg 가 표준 추세.

Streaming → Lakehouse

# Spark Structured Streaming
stream = spark.readStream.format('kafka').option(...).load()
parsed = stream.selectExpr('CAST(value AS STRING) as json').select(from_json('json', schema).alias('d'))
flat = parsed.select('d.*')

flat.writeStream \
    .format('iceberg') \
    .outputMode('append') \
    .option('path', 'cat.db.events') \
    .option('checkpointLocation', 's3://checkpoints/events') \
    .trigger(processingTime='1 minute') \
    .start()

→ Real-time → Iceberg.

CDC ingestion (Debezium → Iceberg)

DB → Debezium → Kafka → Spark / Flink → Iceberg

File layout

s3://bucket/warehouse/db/orders/
├── data/
│   ├── year=2026/month=05/day=09/file-uuid.parquet
│   └── ...
└── metadata/
    ├── snap-xxx.avro    (snapshot)
    ├── manifest-yyy.avro (manifest list)
    └── v1.metadata.json (version pointer)

Catalog (REST / Hive / Glue / Nessie)

Hive Metastore — legacy
AWS Glue       — AWS native
REST catalog   — Iceberg 표준
Nessie         — git-like branching
Polaris        — open
Tabular        — managed
# Nessie — branch / merge
spark.sql("CREATE BRANCH dev IN cat FROM main")
spark.sql("USE REFERENCE dev IN cat")
# Dev 환경 — production 영향 X

Cost

S3 storage: $23/TB/month (Standard)
Glacier:    $4/TB/month (cold)

vs warehouse:
Snowflake: $40+/TB/month (compute 별도)
BigQuery:  $20/TB/month + $6.25/TB query

→ Lakehouse = 큰 cost 절감.

Compute engines

Spark:       표준 batch
Flink:       streaming
Trino:       interactive query
DuckDB:      single-node, fast
DataFusion:  Rust, embeddable
Snowflake / BigQuery: 외부 catalog 통해 query

🤔 의사결정 기준

상황 추천
새 lake Iceberg
Databricks Delta Lake
Streaming heavy Hudi 또는 Iceberg + Flink
작은 / 단일 노드 DuckDB + Parquet
Compute analytic Trino / Spark
Managed Snowflake / BigQuery / Databricks

안티패턴

  • CSV / JSON prod: parse 비싸, schema 약함. Parquet.
  • 작은 file 많음: query slow. Compaction.
  • Partition 너무 잘게: 너무 많은 file.
  • Snapshot expire 안 함: storage 폭발.
  • Schema 무관 INSERT: 깨짐. enforce.
  • Direct S3 write 동기화 X: race. transactional.
  • Catalog 없음 — file path 직접: schema 추적 안 됨.

🤖 LLM 활용 힌트

  • Iceberg + S3 + Trino/Spark 가 modern OSS stack.
  • Catalog (Glue / Nessie / Polaris).
  • Compaction + snapshot expire 정기.

🔗 관련 문서