--- id: data-eng-lakehouse title: Lakehouse β€” Iceberg / Delta / Parquet category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [data-engineering, lakehouse, iceberg, parquet, vibe-coding] tech_stack: { language: "SQL / Python", applicable_to: ["Data Engineering"] } applied_in: [] aliases: [Apache Iceberg, Delta Lake, Hudi, Parquet, lakehouse, ACID on object storage] --- # Lakehouse (Iceberg / Delta / Hudi) > Object storage (S3) + table format = warehouse 의 transaction + lake 의 cost. **Apache Iceberg = open standard, Delta Lake (Databricks), Hudi**. Spark / Trino / DuckDB / DataFusion κ°€ query. ## πŸ“– 핡심 κ°œλ… - Parquet: 컬럼 binary format, μ••μΆ•. - Table format: metadata layer β€” schema, snapshot, ACID. - Time travel: μ˜› snapshot query. - Merge-on-Read vs Copy-on-Write. ## πŸ’» μ½”λ“œ νŒ¨ν„΄ ### Parquet (κΈ°λ³Έ file format) ```python import pandas as pd df = pd.DataFrame({'id': [1, 2, 3], 'name': ['a', 'b', 'c']}) df.to_parquet('s3://bucket/data.parquet', engine='pyarrow', compression='zstd') # Read df = pd.read_parquet('s3://bucket/data.parquet') ``` β†’ Compression μžλ™, 컬럼 λ‹¨μœ„ read κ°€λŠ₯. ### Apache Iceberg (Spark) ```python from pyspark.sql import SparkSession spark = SparkSession.builder \ .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \ .config('spark.sql.catalog.cat', 'org.apache.iceberg.spark.SparkCatalog') \ .config('spark.sql.catalog.cat.type', 'hadoop') \ .config('spark.sql.catalog.cat.warehouse', 's3://bucket/warehouse') \ .getOrCreate() # ν…Œμ΄λΈ” 생성 spark.sql(''' CREATE TABLE cat.db.orders ( id BIGINT, user_id STRING, amount DECIMAL(10, 2), created_at TIMESTAMP ) USING iceberg PARTITIONED BY (days(created_at)) ''') # Insert spark.sql("INSERT INTO cat.db.orders VALUES (1, 'u1', 99.50, '2026-05-09')") # Time travel spark.sql("SELECT * FROM cat.db.orders VERSION AS OF 12345") spark.sql("SELECT * FROM cat.db.orders TIMESTAMP AS OF '2026-05-01'") ``` ### Iceberg with Trino / Athena / DuckDB ```sql -- Trino CREATE TABLE iceberg.db.orders (...) WITH (format = 'PARQUET', partitioning = ARRAY['day(created_at)']); -- DuckDB (modern, lightweight) INSTALL iceberg; LOAD iceberg; SELECT * FROM iceberg_scan('s3://bucket/orders'); ``` ### Schema evolution ```sql ALTER TABLE cat.db.orders ADD COLUMN status STRING; ALTER TABLE cat.db.orders RENAME COLUMN amount TO total; ALTER TABLE cat.db.orders DROP COLUMN status; ``` β†’ μ˜› file 도 ν˜Έν™˜. μ•ˆμ „. ### Partition evolution ```sql ALTER TABLE cat.db.orders REPLACE PARTITION FIELD days(created_at) WITH hours(created_at); ``` β†’ μ˜› data κ·ΈλŒ€λ‘œ. μƒˆ data 만 μƒˆ partition 으둜. ### Compaction (μž‘μ€ file β†’ 큰 file) ```sql CALL cat.system.rewrite_data_files('db.orders'); ``` β†’ Small file 문제 ν•΄κ²°. ### MERGE INTO (UPSERT) ```sql MERGE INTO cat.db.orders t USING new_orders s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *; ``` ### Snapshot 관리 ```sql -- μ˜› snapshot 만료 (storage μ ˆμ•½) CALL cat.system.expire_snapshots('db.orders', TIMESTAMP '2026-04-01'); -- μ˜› file 정리 CALL cat.system.remove_orphan_files('db.orders'); ``` ### Delta Lake (Databricks μΉœν™”) ```python from delta import configure_spark_with_delta_pip builder = SparkSession.builder.config( "spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtensions" ).config( "spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog" ) spark = configure_spark_with_delta_pip(builder).getOrCreate() spark.sql('CREATE TABLE db.orders (...) USING DELTA') spark.sql('SELECT * FROM db.orders VERSION AS OF 5') ``` ```python # Python API from delta.tables import DeltaTable dt = DeltaTable.forPath(spark, '/path/to/orders') dt.alias('t').merge( new_data.alias('s'), 't.id = s.id' ).whenMatchedUpdateAll() \ .whenNotMatchedInsertAll() \ .execute() # Time travel df = spark.read.format('delta').option('versionAsOf', 5).load('/path') ``` ### Iceberg vs Delta vs Hudi ``` Iceberg: + κ°€μž₯ open (Apache, vendor-neutral) + Schema/partition evolution κ°• + 큰 ecosystem (Snowflake, BigQuery, AWS, Trino) Delta Lake: + Databricks native + Modern features 빠름 - Open source 정도 (DI 전체 X) Hudi: + Streaming μΉœν™” + Merge-on-Read κ°• - μž‘μ€ community (vs Iceberg) ``` β†’ **2026 ν˜„μž¬ = Iceberg κ°€ ν‘œμ€€ μΆ”μ„Έ**. ### Streaming β†’ Lakehouse ```python # Spark Structured Streaming stream = spark.readStream.format('kafka').option(...).load() parsed = stream.selectExpr('CAST(value AS STRING) as json').select(from_json('json', schema).alias('d')) flat = parsed.select('d.*') flat.writeStream \ .format('iceberg') \ .outputMode('append') \ .option('path', 'cat.db.events') \ .option('checkpointLocation', 's3://checkpoints/events') \ .trigger(processingTime='1 minute') \ .start() ``` β†’ Real-time β†’ Iceberg. ### CDC ingestion (Debezium β†’ Iceberg) ``` DB β†’ Debezium β†’ Kafka β†’ Spark / Flink β†’ Iceberg ``` ### File layout ``` s3://bucket/warehouse/db/orders/ β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ year=2026/month=05/day=09/file-uuid.parquet β”‚ └── ... └── metadata/ β”œβ”€β”€ snap-xxx.avro (snapshot) β”œβ”€β”€ manifest-yyy.avro (manifest list) └── v1.metadata.json (version pointer) ``` ### Catalog (REST / Hive / Glue / Nessie) ``` Hive Metastore β€” legacy AWS Glue β€” AWS native REST catalog β€” Iceberg ν‘œμ€€ Nessie β€” git-like branching Polaris β€” open Tabular β€” managed ``` ```python # Nessie β€” branch / merge spark.sql("CREATE BRANCH dev IN cat FROM main") spark.sql("USE REFERENCE dev IN cat") # Dev ν™˜κ²½ β€” production 영ν–₯ X ``` ### Cost ``` S3 storage: $23/TB/month (Standard) Glacier: $4/TB/month (cold) vs warehouse: Snowflake: $40+/TB/month (compute 별도) BigQuery: $20/TB/month + $6.25/TB query ``` β†’ Lakehouse = 큰 cost 절감. ### Compute engines ``` Spark: ν‘œμ€€ batch Flink: streaming Trino: interactive query DuckDB: single-node, fast DataFusion: Rust, embeddable Snowflake / BigQuery: μ™ΈλΆ€ catalog 톡해 query ``` ## πŸ€” μ˜μ‚¬κ²°μ • κΈ°μ€€ | 상황 | μΆ”μ²œ | |---|---| | μƒˆ lake | Iceberg | | Databricks | Delta Lake | | Streaming heavy | Hudi λ˜λŠ” Iceberg + Flink | | μž‘μ€ / 단일 λ…Έλ“œ | DuckDB + Parquet | | Compute analytic | Trino / Spark | | Managed | Snowflake / BigQuery / Databricks | ## ❌ μ•ˆν‹°νŒ¨ν„΄ - **CSV / JSON prod**: parse λΉ„μ‹Έ, schema 약함. Parquet. - **μž‘μ€ file 많음**: query slow. Compaction. - **Partition λ„ˆλ¬΄ 잘게**: λ„ˆλ¬΄ λ§Žμ€ file. - **Snapshot expire μ•ˆ 함**: storage 폭발. - **Schema 무관 INSERT**: 깨짐. enforce. - **Direct S3 write 동기화 X**: race. transactional. - **Catalog μ—†μŒ β€” file path 직접**: schema 좔적 μ•ˆ 됨. ## πŸ€– LLM ν™œμš© 힌트 - Iceberg + S3 + Trino/Spark κ°€ modern OSS stack. - Catalog (Glue / Nessie / Polaris). - Compaction + snapshot expire μ •κΈ°. ## πŸ”— κ΄€λ ¨ λ¬Έμ„œ - [[Data_Eng_dbt]] - [[Data_Eng_Airflow_Dagster]] - [[DB_ClickHouse_OLAP]]