7.6 KiB
7.6 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cs-quorum-consensus | Quorum / Consensus — Paxos / Raft / Dynamo style | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Quorum / Consensus
분산 system 의 핵심 algorithm. Paxos (foundational), Raft (modern), Dynamo (eventually consistent). Quorum = N/2+1 동의.
📖 핵심 개념
- N: total replica 수.
- W: write quorum (W replica 가 commit).
- R: read quorum.
- R + W > N → strong consistency.
💻 코드 패턴
Quorum 식 (R + W > N)
N = 5 (5 replica).
Strong:
- W = 3, R = 3 → R + W = 6 > 5. 강 consistency.
- 2 down OK (3 가 quorum).
Eventual:
- W = 1, R = 1 → 매 read 가 이전 write 의 update 못 봄.
- 가장 빠름.
균형:
- W = 3, R = 1 (read-heavy).
- W = 1, R = 3 (write-heavy).
Dynamo (Cassandra 식)
// Coordinator 가 5 replica 에 write 보냄
// W=3 ack 가져오면 success.
async function write(key, value) {
const responses = await Promise.allSettled(
replicas.map(r => r.write(key, value))
);
const success = responses.filter(r => r.status === 'fulfilled').length;
if (success < W) throw new Error('write failed');
// 다른 replica 도 background 로 catch up.
}
Read repair
async function read(key) {
const responses = await Promise.allSettled(
replicas.slice(0, R).map(r => r.read(key))
);
// 매 replica 의 version 비교
const values = responses.map(r => r.value);
const latest = mostRecent(values); // by vector clock / timestamp
// 옛 replica 도 update (read repair)
for (const [i, v] of values.entries()) {
if (v !== latest) replicas[i].write(key, latest, { background: true });
}
return latest;
}
Hinted handoff
W=3 의 write.
1 replica 가 down.
Coordinator 가 다른 node 가 임시 store ("hint").
Down replica 가 up = hint 가 transfer.
→ Availability ↑.
Paxos (foundational)
Phase 1 (Prepare):
- Proposer 가 number N 으로 prepare.
- Acceptor 가 N > 자기 가장 큰 = OK.
Phase 2 (Accept):
- Proposer 가 value 보냄.
- Acceptor 가 majority OK = accepted.
Phase 3 (Learn):
- Learner 가 value 알아.
→ 매 round 가 1 value. 복잡.
→ Multi-Paxos 가 series.
Raft (modern, simple)
3 role:
- Leader: write 받음, follower 에 replicate.
- Follower: leader 의 entry append.
- Candidate: leader election.
매 N follower 가 ack = committed (W = N/2 + 1).
Raft election
Leader 가 heartbeat 안 보내면:
1. Follower 가 timeout (random 150-300 ms).
2. Candidate 가 됨.
3. RequestVote 가 다른 node.
4. Majority 가 vote = leader.
Term: 매 election 의 number.
Raft log replication
Client → Leader → AppendEntries(log) → Followers.
Followers ack → Leader 가 commit.
Leader → 다음 heartbeat 가 commit index.
Followers 가 apply.
→ 모든 node 가 같은 sequence.
Raft 의 implementation
- etcd (CoreOS / Kubernetes)
- Consul (HashiCorp)
- TiKV / CockroachDB / YugabyteDB
- RAFTKE (Rust)
- nuraft (C++)
When use?
- Distributed lock (etcd)
- Service discovery (Consul)
- Distributed DB (CockroachDB, TiDB)
- Configuration store (ZooKeeper, etcd)
vs Paxos
Paxos: 가장 first, complex.
Raft: equivalent, easier to understand.
→ Modern = Raft.
Byzantine fault tolerance (BFT)
정상 fault: node 가 crash.
Byzantine: node 가 lie (악성).
→ Paxos / Raft 가 안 다룸 (crash-only).
PBFT, Tendermint 가 BFT.
Blockchain 가 BFT (보통).
CAP theorem
Consistency vs Availability vs Partition tolerance.
2 만 (3 다 안 됨).
CP: Consistency + Partition (예: HBase, MongoDB).
AP: Availability + Partition (예: Cassandra, DynamoDB).
CA: 안 됨 (network 가 partition 됨).
Network partition 시
CP: minority partition 가 reject (read/write 안 됨).
AP: 양쪽 partition 가 read/write OK. 이후 reconcile (CRDT 등).
→ Real world = AP / CP 의 mix.
Strong vs eventual
Strong: 매 read 가 이전 write 봄.
Linearizable: 시간 순서 보존.
Sequential: 매 process 의 순서 보존.
Causal: causality 보존.
Eventual: 결국 같음 (no time bound).
Read-your-write
사용자 가 자기 write 후 immediate read = visible.
구현:
- Sticky session (같은 replica).
- Write ack 후 cache.
- Eventually consistent + 사용자 별 latest 추적.
Quorum 의 함정
- W = N (all replica): 1 down = write fail. Brittle.
- W = 1: read 가 stale.
- W = R = 1, N = 3: 가장 fast, weakest.
- W = 3, R = 3, N = 5: 강 + 2 down OK.
Network partition 의 실제
Split-brain: 두 partition 가 각자 leader.
- Raft 가 막음 (term + majority).
- Manual recovery 가 필요할 때 있음.
→ Consul / etcd 의 production tip = 5-7 node, odd count.
Single leader vs leaderless
Single leader (Raft, Paxos):
- 단순 reasoning.
- Bottleneck (leader 가 모든 write).
Leaderless (Dynamo):
- 매 write 가 임의 node.
- Conflict resolution 필요.
- 큰 throughput.
→ Trade-off.
CockroachDB / Spanner
Range = 64 MB.
매 range 가 own Raft group.
1000 range = 1000 leader (parallel write).
→ Scale 의 비.
Distributed lock (Raft 식)
// etcd
const lease = await client.lease.grant(10);
await client.kv.put('lock', 'value', { lease });
// 다른 client 가 wait
await client.watch.compactWatch('lock');
→ etcd 의 native support.
Failure modes
- Network slow → timeout / retry.
- Network partition → split-brain (rare).
- Node crash → leader re-election.
- Disk full → write fail.
- Clock skew → consensus 어려움 (HLC 사용).
Monitoring
- Leader changes (자주 = 문제).
- Log lag (follower 가 leader 보다 뒤).
- Quorum size (down node count).
- Apply latency.
Gossip protocol (다른)
모든 node 가 random peer 에 정보.
- Cassandra / Consul / Riak 가 사용.
- 매 N round = exponential 전파.
- Eventually consistent.
→ Membership / failure detection.
→ Consensus 와 다름.
Two-phase commit (2PC)
Coordinator + N participants.
Phase 1: prepare (lock + log).
Phase 2: commit / abort (모두 ack).
→ Cross-DB transaction.
"매 participant 가 OK 면 commit".
함정:
- Coordinator down 시 stuck.
- 매우 느림.
- 큰 system 가 안 사용.
→ Saga 가 modern alternative.
→ Backend_Saga_Choreography_vs_Orchestration.
Real-world
- etcd: K8s 의 brain.
- Consul: service mesh.
- ZooKeeper: 옛 (Kafka 의 older).
- TiKV / CockroachDB: distributed SQL.
- Apache BookKeeper: log.
- Kafka: 자체 KRaft (ZK 대체).
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| Strong consistency | Raft (etcd, CockroachDB) |
| Eventually consistent | Dynamo / Cassandra |
| Distributed lock | etcd / Consul |
| Service discovery | Consul / etcd |
| BFT | Tendermint / blockchain |
| 작은 system | Single-node DB |
| Cross-DB transaction | Saga (NOT 2PC) |
❌ 안티패턴
- Even node count (4): split-brain risk.
- W = N: 1 down = fail.
- Wall clock 가정 distributed: HLC 사용.
- 2PC 큰 system: 대안 (saga).
- Manual leader election: 깨짐 자주.
- No monitoring: silent.
🤖 LLM 활용 힌트
- Raft 가 Paxos 의 modern (easier).
- Dynamo 식 = AP (eventual).
- R + W > N 가 strong consistency rule.
- Odd node count (3, 5, 7).