5.7 KiB
5.7 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cs-btree-lsm-storage | B-Tree vs LSM-Tree — Storage 엔진 | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
B-Tree vs LSM-Tree
DB 의 두 storage engine. B-Tree (Postgres / MySQL InnoDB) = read 빠름, in-place update. LSM-Tree (RocksDB / Cassandra / ScyllaDB) = write 빠름, append-only. Trade-off: read amp / write amp / space amp.
📖 핵심 개념
- B-Tree: balanced tree, in-place update.
- LSM: write → memtable → SSTable (immutable) → compaction.
- Read amplification: 한 read 가 N file 검사.
- Write amplification: 한 write 가 N 번 disk write.
- Space amplification: 데이터 + 사본 / 압축 차이.
💻 코드 패턴
B-Tree 동작
Read: Root → branch → leaf. log(N) seek.
Write: Page 직접 변경 (또는 WAL + page flush).
Delete: Page 안 mark, vacuum 으로 정리.
장점: O(log N) read, range scan 빠름, mature.
단점: Page split 비싸, 작은 random write 가 page 다시 write.
LSM 동작
Write:
1. Memtable (RAM, sorted) 에 추가
2. Memtable 가득 → SSTable (sorted, immutable) 로 flush
3. Compaction: 여러 SSTable → 합치기
Read:
1. Memtable 검사
2. 각 level 의 SSTable 검사 (Bloom filter 가 skip)
3. 가장 최신 version 반환
Delete: tombstone 추가. Compaction 가 정리.
Compaction strategy
Leveled (RocksDB):
- Level N = N+1 의 ~10x 크기
- 작은 read amp, 큰 write amp
Tiered (Cassandra):
- 같은 level 의 작은 SSTable 합치기
- 작은 write amp, 큰 read amp
Hybrid: ScyllaDB.
B-Tree 의 page 구조
[ Page header | Key1 → Pointer1 | Key2 → Pointer2 | ... ]
Page size: 보통 8KB (Postgres) / 16KB (MySQL).
Fillfactor: 80% — UPDATE 위 free space 남김 (HOT update).
LSM 의 SSTable
[ Header | Index | Bloom filter | Sorted key-value pairs | Footer ]
Index = sparse (every Nth key).
Bloom filter = 이 key 가 이 SSTable 에 없을지 빠른 검사.
Write amplification 실측
Insert 1 byte → disk 에 N bytes write.
B-Tree: 보통 2-10x (page write + WAL).
LSM (leveled): 10-30x (compaction).
LSM (tiered): 5-15x.
Read amplification
Get key X →
B-Tree: log(N) page (cache 가 보통 처리).
LSM: 여러 level + memtable. Bloom 가 skip 도와줌.
Space amplification
1GB 데이터 →
B-Tree: 1GB + index. 1.5x.
LSM: 1GB + 압축 + tombstone + 옛 version. 1.1-2x (compaction 정도).
적합 use case
B-Tree:
- OLTP (random read + update + delete)
- 일관된 read latency
- Range query 자주
- Postgres / MySQL / SQLite
LSM:
- Write-heavy (시계열, log)
- 빠른 ingestion
- Range scan 도 OK
- Cassandra / RocksDB / LevelDB / DynamoDB / ScyllaDB
Hybrid
Postgres + Heap + WAL: B-Tree 그러나 log-structured 측면.
ZFS / Btrfs: copy-on-write file system — LSM 같은 측면.
튜닝 — Postgres B-Tree
-- Page fill factor (UPDATE-heavy)
ALTER TABLE x SET (fillfactor = 80);
-- Index fillfactor
CREATE INDEX ON x (col) WITH (fillfactor = 90);
-- Vacuum 자주 (bloat 방지)
ALTER TABLE x SET (autovacuum_vacuum_scale_factor = 0.05);
튜닝 — RocksDB LSM
write_buffer_size: Memtable 크기
max_write_buffer_number: 동시 memtable
level0_file_num_compaction_trigger
target_file_size_base: SSTable 크기
compression_per_level: 각 level 의 압축
bloom_filter_bits_per_key: read 가속
사용 라이브러리 — Node
// LevelDB / RocksDB
import { Level } from 'level';
const db = new Level('./db', { valueEncoding: 'json' });
await db.put('key', { value: 42 });
const v = await db.get('key');
// Range
for await (const [k, v] of db.iterator({ gte: 'a', lte: 'z' })) {
console.log(k, v);
}
Sorted vs unsorted
B-Tree: 내장 sorted (by key).
LSM: sorted (by key) — range scan OK.
Hash: unsorted (no range, only point lookup) — Memcached, hash index.
Cache hierarchy
RAM (page cache / memtable) → SSD (data) → 옛 SSD / HDD (cold).
Postgres shared_buffers: 25% RAM 권장.
RocksDB block_cache: workload 따라.
알고리즘 visualization
B-Tree insertion:
1. Find leaf
2. If full → split, push median up
3. Recursive up
LSM compaction:
1. L0 file count > threshold → merge into L1
2. L1 size > target → merge oldest into L2
...
Modern 변형
Fractal Tree: B-Tree + log buffer (TokuDB).
Bw-Tree: lock-free B-Tree 변형 (Hekaton, Microsoft).
Adaptive Radix Tree (ART): 메모리 DB.
LSM with bloom filters per level.
🤔 의사결정 기준
| Workload | Engine |
|---|---|
| OLTP (banking, orders) | B-Tree (Postgres / InnoDB) |
| Time-series / logs | LSM (Cassandra / TimescaleDB) |
| Write-heavy + range | LSM (RocksDB) |
| Mostly read | B-Tree |
| Embedded | LevelDB / SQLite (B-Tree) |
| Distributed write | LSM (Cassandra / ScyllaDB) |
❌ 안티패턴
- B-Tree 큰 random insert: page split 폭발. UUID v7.
- LSM short value frequent overwrite: write amp 큼. 다른 storage.
- Compaction off LSM: read amp 폭발.
- Vacuum off B-Tree: bloat.
- Bloom filter off LSM: read 매번 모든 SSTable.
- Cache size 무시: 디스크 hit 자주.
- B-Tree 가정 + LSM DB 사용: trade-off 모름.
🤖 LLM 활용 힌트
- Postgres / MySQL = B-Tree (대부분 case).
- Cassandra / RocksDB = LSM (write-heavy).
- 알고 쓰면 튜닝 정확.