Files
2nd/10_Wiki/Topics/Coding/CS_BTree_LSM_Storage.md
T
2026-05-09 21:08:02 +09:00

5.7 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
cs-btree-lsm-storage B-Tree vs LSM-Tree — Storage 엔진 Coding draft B conceptual 2026-05-09 2026-05-09
cs
storage
btree
lsm
vibe-coding
language applicable_to
Concept
Database
B-Tree
LSM-Tree
RocksDB
Postgres
MyISAM
write amplification
read amplification

B-Tree vs LSM-Tree

DB 의 두 storage engine. B-Tree (Postgres / MySQL InnoDB) = read 빠름, in-place update. LSM-Tree (RocksDB / Cassandra / ScyllaDB) = write 빠름, append-only. Trade-off: read amp / write amp / space amp.

📖 핵심 개념

  • B-Tree: balanced tree, in-place update.
  • LSM: write → memtable → SSTable (immutable) → compaction.
  • Read amplification: 한 read 가 N file 검사.
  • Write amplification: 한 write 가 N 번 disk write.
  • Space amplification: 데이터 + 사본 / 압축 차이.

💻 코드 패턴

B-Tree 동작

Read:    Root → branch → leaf.  log(N) seek.
Write:   Page 직접 변경 (또는 WAL + page flush).
Delete:  Page 안 mark, vacuum 으로 정리.

장점: O(log N) read, range scan 빠름, mature.
단점: Page split 비싸, 작은 random write 가 page 다시 write.

LSM 동작

Write:
1. Memtable (RAM, sorted) 에 추가
2. Memtable 가득 → SSTable (sorted, immutable) 로 flush
3. Compaction: 여러 SSTable → 합치기

Read:
1. Memtable 검사
2. 각 level 의 SSTable 검사 (Bloom filter 가 skip)
3. 가장 최신 version 반환

Delete: tombstone 추가. Compaction 가 정리.

Compaction strategy

Leveled (RocksDB):
- Level N = N+1 의 ~10x 크기
- 작은 read amp, 큰 write amp

Tiered (Cassandra):
- 같은 level 의 작은 SSTable 합치기
- 작은 write amp, 큰 read amp

Hybrid: ScyllaDB.

B-Tree 의 page 구조

[ Page header | Key1 → Pointer1 | Key2 → Pointer2 | ... ]

Page size: 보통 8KB (Postgres) / 16KB (MySQL).
Fillfactor: 80% — UPDATE 위 free space 남김 (HOT update).

LSM 의 SSTable

[ Header | Index | Bloom filter | Sorted key-value pairs | Footer ]

Index = sparse (every Nth key).
Bloom filter = 이 key 가 이 SSTable 에 없을지 빠른 검사.

Write amplification 실측

Insert 1 byte → disk 에 N bytes write.

B-Tree: 보통 2-10x (page write + WAL).
LSM (leveled): 10-30x (compaction).
LSM (tiered): 5-15x.

Read amplification

Get key X →

B-Tree: log(N) page (cache 가 보통 처리).
LSM:    여러 level + memtable. Bloom 가 skip 도와줌.

Space amplification

1GB 데이터 →

B-Tree: 1GB + index. 1.5x.
LSM:    1GB + 압축 + tombstone + 옛 version. 1.1-2x (compaction 정도).

적합 use case

B-Tree:
- OLTP (random read + update + delete)
- 일관된 read latency
- Range query 자주
- Postgres / MySQL / SQLite

LSM:
- Write-heavy (시계열, log)
- 빠른 ingestion
- Range scan 도 OK
- Cassandra / RocksDB / LevelDB / DynamoDB / ScyllaDB

Hybrid

Postgres + Heap + WAL: B-Tree 그러나 log-structured 측면.
ZFS / Btrfs: copy-on-write file system — LSM 같은 측면.

튜닝 — Postgres B-Tree

-- Page fill factor (UPDATE-heavy)
ALTER TABLE x SET (fillfactor = 80);

-- Index fillfactor
CREATE INDEX ON x (col) WITH (fillfactor = 90);

-- Vacuum 자주 (bloat 방지)
ALTER TABLE x SET (autovacuum_vacuum_scale_factor = 0.05);

튜닝 — RocksDB LSM

write_buffer_size:           Memtable 크기
max_write_buffer_number:     동시 memtable
level0_file_num_compaction_trigger
target_file_size_base:       SSTable 크기
compression_per_level:       각 level 의 압축
bloom_filter_bits_per_key:   read 가속

사용 라이브러리 — Node

// LevelDB / RocksDB
import { Level } from 'level';
const db = new Level('./db', { valueEncoding: 'json' });
await db.put('key', { value: 42 });
const v = await db.get('key');

// Range
for await (const [k, v] of db.iterator({ gte: 'a', lte: 'z' })) {
  console.log(k, v);
}

Sorted vs unsorted

B-Tree:  내장 sorted (by key).
LSM:     sorted (by key) — range scan OK.
Hash:    unsorted (no range, only point lookup) — Memcached, hash index.

Cache hierarchy

RAM (page cache / memtable) → SSD (data) → 옛 SSD / HDD (cold).

Postgres shared_buffers: 25% RAM 권장.
RocksDB block_cache: workload 따라.

알고리즘 visualization

B-Tree insertion:
1. Find leaf
2. If full → split, push median up
3. Recursive up

LSM compaction:
1. L0 file count > threshold → merge into L1
2. L1 size > target → merge oldest into L2
...

Modern 변형

Fractal Tree: B-Tree + log buffer (TokuDB).
Bw-Tree:      lock-free B-Tree 변형 (Hekaton, Microsoft).
Adaptive Radix Tree (ART): 메모리 DB.
LSM with bloom filters per level.

🤔 의사결정 기준

Workload Engine
OLTP (banking, orders) B-Tree (Postgres / InnoDB)
Time-series / logs LSM (Cassandra / TimescaleDB)
Write-heavy + range LSM (RocksDB)
Mostly read B-Tree
Embedded LevelDB / SQLite (B-Tree)
Distributed write LSM (Cassandra / ScyllaDB)

안티패턴

  • B-Tree 큰 random insert: page split 폭발. UUID v7.
  • LSM short value frequent overwrite: write amp 큼. 다른 storage.
  • Compaction off LSM: read amp 폭발.
  • Vacuum off B-Tree: bloat.
  • Bloom filter off LSM: read 매번 모든 SSTable.
  • Cache size 무시: 디스크 hit 자주.
  • B-Tree 가정 + LSM DB 사용: trade-off 모름.

🤖 LLM 활용 힌트

  • Postgres / MySQL = B-Tree (대부분 case).
  • Cassandra / RocksDB = LSM (write-heavy).
  • 알고 쓰면 튜닝 정확.

🔗 관련 문서