--- id: cs-btree-lsm-storage title: B-Tree vs LSM-Tree — Storage 엔진 category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [cs, storage, btree, lsm, vibe-coding] tech_stack: { language: "Concept", applicable_to: ["Database"] } applied_in: [] aliases: [B-Tree, LSM-Tree, RocksDB, Postgres, MyISAM, write amplification, read amplification] --- # B-Tree vs LSM-Tree > DB 의 두 storage engine. **B-Tree (Postgres / MySQL InnoDB) = read 빠름, in-place update**. **LSM-Tree (RocksDB / Cassandra / ScyllaDB) = write 빠름, append-only**. Trade-off: read amp / write amp / space amp. ## 📖 핵심 개념 - B-Tree: balanced tree, in-place update. - LSM: write → memtable → SSTable (immutable) → compaction. - Read amplification: 한 read 가 N file 검사. - Write amplification: 한 write 가 N 번 disk write. - Space amplification: 데이터 + 사본 / 압축 차이. ## 💻 코드 패턴 ### B-Tree 동작 ``` Read: Root → branch → leaf. log(N) seek. Write: Page 직접 변경 (또는 WAL + page flush). Delete: Page 안 mark, vacuum 으로 정리. 장점: O(log N) read, range scan 빠름, mature. 단점: Page split 비싸, 작은 random write 가 page 다시 write. ``` ### LSM 동작 ``` Write: 1. Memtable (RAM, sorted) 에 추가 2. Memtable 가득 → SSTable (sorted, immutable) 로 flush 3. Compaction: 여러 SSTable → 합치기 Read: 1. Memtable 검사 2. 각 level 의 SSTable 검사 (Bloom filter 가 skip) 3. 가장 최신 version 반환 Delete: tombstone 추가. Compaction 가 정리. ``` ### Compaction strategy ``` Leveled (RocksDB): - Level N = N+1 의 ~10x 크기 - 작은 read amp, 큰 write amp Tiered (Cassandra): - 같은 level 의 작은 SSTable 합치기 - 작은 write amp, 큰 read amp Hybrid: ScyllaDB. ``` ### B-Tree 의 page 구조 ``` [ Page header | Key1 → Pointer1 | Key2 → Pointer2 | ... ] Page size: 보통 8KB (Postgres) / 16KB (MySQL). Fillfactor: 80% — UPDATE 위 free space 남김 (HOT update). ``` ### LSM 의 SSTable ``` [ Header | Index | Bloom filter | Sorted key-value pairs | Footer ] Index = sparse (every Nth key). Bloom filter = 이 key 가 이 SSTable 에 없을지 빠른 검사. ``` ### Write amplification 실측 ``` Insert 1 byte → disk 에 N bytes write. B-Tree: 보통 2-10x (page write + WAL). LSM (leveled): 10-30x (compaction). LSM (tiered): 5-15x. ``` ### Read amplification ``` Get key X → B-Tree: log(N) page (cache 가 보통 처리). LSM: 여러 level + memtable. Bloom 가 skip 도와줌. ``` ### Space amplification ``` 1GB 데이터 → B-Tree: 1GB + index. 1.5x. LSM: 1GB + 압축 + tombstone + 옛 version. 1.1-2x (compaction 정도). ``` ### 적합 use case ``` B-Tree: - OLTP (random read + update + delete) - 일관된 read latency - Range query 자주 - Postgres / MySQL / SQLite LSM: - Write-heavy (시계열, log) - 빠른 ingestion - Range scan 도 OK - Cassandra / RocksDB / LevelDB / DynamoDB / ScyllaDB ``` ### Hybrid ``` Postgres + Heap + WAL: B-Tree 그러나 log-structured 측면. ZFS / Btrfs: copy-on-write file system — LSM 같은 측면. ``` ### 튜닝 — Postgres B-Tree ```sql -- Page fill factor (UPDATE-heavy) ALTER TABLE x SET (fillfactor = 80); -- Index fillfactor CREATE INDEX ON x (col) WITH (fillfactor = 90); -- Vacuum 자주 (bloat 방지) ALTER TABLE x SET (autovacuum_vacuum_scale_factor = 0.05); ``` ### 튜닝 — RocksDB LSM ``` write_buffer_size: Memtable 크기 max_write_buffer_number: 동시 memtable level0_file_num_compaction_trigger target_file_size_base: SSTable 크기 compression_per_level: 각 level 의 압축 bloom_filter_bits_per_key: read 가속 ``` ### 사용 라이브러리 — Node ```ts // LevelDB / RocksDB import { Level } from 'level'; const db = new Level('./db', { valueEncoding: 'json' }); await db.put('key', { value: 42 }); const v = await db.get('key'); // Range for await (const [k, v] of db.iterator({ gte: 'a', lte: 'z' })) { console.log(k, v); } ``` ### Sorted vs unsorted ``` B-Tree: 내장 sorted (by key). LSM: sorted (by key) — range scan OK. Hash: unsorted (no range, only point lookup) — Memcached, hash index. ``` ### Cache hierarchy ``` RAM (page cache / memtable) → SSD (data) → 옛 SSD / HDD (cold). Postgres shared_buffers: 25% RAM 권장. RocksDB block_cache: workload 따라. ``` ### 알고리즘 visualization ``` B-Tree insertion: 1. Find leaf 2. If full → split, push median up 3. Recursive up LSM compaction: 1. L0 file count > threshold → merge into L1 2. L1 size > target → merge oldest into L2 ... ``` ### Modern 변형 ``` Fractal Tree: B-Tree + log buffer (TokuDB). Bw-Tree: lock-free B-Tree 변형 (Hekaton, Microsoft). Adaptive Radix Tree (ART): 메모리 DB. LSM with bloom filters per level. ``` ## 🤔 의사결정 기준 | Workload | Engine | |---|---| | OLTP (banking, orders) | B-Tree (Postgres / InnoDB) | | Time-series / logs | LSM (Cassandra / TimescaleDB) | | Write-heavy + range | LSM (RocksDB) | | Mostly read | B-Tree | | Embedded | LevelDB / SQLite (B-Tree) | | Distributed write | LSM (Cassandra / ScyllaDB) | ## ❌ 안티패턴 - **B-Tree 큰 random insert**: page split 폭발. UUID v7. - **LSM short value frequent overwrite**: write amp 큼. 다른 storage. - **Compaction off LSM**: read amp 폭발. - **Vacuum off B-Tree**: bloat. - **Bloom filter off LSM**: read 매번 모든 SSTable. - **Cache size 무시**: 디스크 hit 자주. - **B-Tree 가정 + LSM DB 사용**: trade-off 모름. ## 🤖 LLM 활용 힌트 - Postgres / MySQL = B-Tree (대부분 case). - Cassandra / RocksDB = LSM (write-heavy). - 알고 쓰면 튜닝 정확. ## 🔗 관련 문서 - [[DB_Index_Strategy]] - [[DB_Vacuum_Autovacuum]] - [[DB_Time_Series_Patterns]]