--- id: cs-quorum-consensus title: Quorum / Consensus — Paxos / Raft / Dynamo style category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [cs, consensus, distributed, vibe-coding] tech_stack: { language: "TS / Go", applicable_to: ["Backend", "CS"] } applied_in: [] aliases: [quorum, consensus, Paxos, Raft, R+W>N, Dynamo, distributed agreement] --- # Quorum / Consensus > 분산 system 의 핵심 algorithm. **Paxos (foundational), Raft (modern), Dynamo (eventually consistent)**. Quorum = N/2+1 동의. ## 📖 핵심 개념 - N: total replica 수. - W: write quorum (W replica 가 commit). - R: read quorum. - R + W > N → strong consistency. ## 💻 코드 패턴 ### Quorum 식 (R + W > N) ``` N = 5 (5 replica). Strong: - W = 3, R = 3 → R + W = 6 > 5. 강 consistency. - 2 down OK (3 가 quorum). Eventual: - W = 1, R = 1 → 매 read 가 이전 write 의 update 못 봄. - 가장 빠름. 균형: - W = 3, R = 1 (read-heavy). - W = 1, R = 3 (write-heavy). ``` ### Dynamo (Cassandra 식) ```ts // Coordinator 가 5 replica 에 write 보냄 // W=3 ack 가져오면 success. async function write(key, value) { const responses = await Promise.allSettled( replicas.map(r => r.write(key, value)) ); const success = responses.filter(r => r.status === 'fulfilled').length; if (success < W) throw new Error('write failed'); // 다른 replica 도 background 로 catch up. } ``` ### Read repair ```ts async function read(key) { const responses = await Promise.allSettled( replicas.slice(0, R).map(r => r.read(key)) ); // 매 replica 의 version 비교 const values = responses.map(r => r.value); const latest = mostRecent(values); // by vector clock / timestamp // 옛 replica 도 update (read repair) for (const [i, v] of values.entries()) { if (v !== latest) replicas[i].write(key, latest, { background: true }); } return latest; } ``` ### Hinted handoff ``` W=3 의 write. 1 replica 가 down. Coordinator 가 다른 node 가 임시 store ("hint"). Down replica 가 up = hint 가 transfer. → Availability ↑. ``` ### Paxos (foundational) ``` Phase 1 (Prepare): - Proposer 가 number N 으로 prepare. - Acceptor 가 N > 자기 가장 큰 = OK. Phase 2 (Accept): - Proposer 가 value 보냄. - Acceptor 가 majority OK = accepted. Phase 3 (Learn): - Learner 가 value 알아. → 매 round 가 1 value. 복잡. ``` → Multi-Paxos 가 series. ### Raft (modern, simple) ``` 3 role: - Leader: write 받음, follower 에 replicate. - Follower: leader 의 entry append. - Candidate: leader election. 매 N follower 가 ack = committed (W = N/2 + 1). ``` ### Raft election ``` Leader 가 heartbeat 안 보내면: 1. Follower 가 timeout (random 150-300 ms). 2. Candidate 가 됨. 3. RequestVote 가 다른 node. 4. Majority 가 vote = leader. Term: 매 election 의 number. ``` ### Raft log replication ``` Client → Leader → AppendEntries(log) → Followers. Followers ack → Leader 가 commit. Leader → 다음 heartbeat 가 commit index. Followers 가 apply. ``` → 모든 node 가 같은 sequence. ### Raft 의 implementation ``` - etcd (CoreOS / Kubernetes) - Consul (HashiCorp) - TiKV / CockroachDB / YugabyteDB - RAFTKE (Rust) - nuraft (C++) ``` ### When use? ``` - Distributed lock (etcd) - Service discovery (Consul) - Distributed DB (CockroachDB, TiDB) - Configuration store (ZooKeeper, etcd) ``` ### vs Paxos ``` Paxos: 가장 first, complex. Raft: equivalent, easier to understand. → Modern = Raft. ``` ### Byzantine fault tolerance (BFT) ``` 정상 fault: node 가 crash. Byzantine: node 가 lie (악성). → Paxos / Raft 가 안 다룸 (crash-only). PBFT, Tendermint 가 BFT. Blockchain 가 BFT (보통). ``` ### CAP theorem ``` Consistency vs Availability vs Partition tolerance. 2 만 (3 다 안 됨). CP: Consistency + Partition (예: HBase, MongoDB). AP: Availability + Partition (예: Cassandra, DynamoDB). CA: 안 됨 (network 가 partition 됨). ``` ### Network partition 시 ``` CP: minority partition 가 reject (read/write 안 됨). AP: 양쪽 partition 가 read/write OK. 이후 reconcile (CRDT 등). ``` → Real world = AP / CP 의 mix. ### Strong vs eventual ``` Strong: 매 read 가 이전 write 봄. Linearizable: 시간 순서 보존. Sequential: 매 process 의 순서 보존. Causal: causality 보존. Eventual: 결국 같음 (no time bound). ``` ### Read-your-write ``` 사용자 가 자기 write 후 immediate read = visible. 구현: - Sticky session (같은 replica). - Write ack 후 cache. - Eventually consistent + 사용자 별 latest 추적. ``` ### Quorum 의 함정 ``` - W = N (all replica): 1 down = write fail. Brittle. - W = 1: read 가 stale. - W = R = 1, N = 3: 가장 fast, weakest. - W = 3, R = 3, N = 5: 강 + 2 down OK. ``` ### Network partition 의 실제 ``` Split-brain: 두 partition 가 각자 leader. - Raft 가 막음 (term + majority). - Manual recovery 가 필요할 때 있음. ``` → Consul / etcd 의 production tip = 5-7 node, odd count. ### Single leader vs leaderless ``` Single leader (Raft, Paxos): - 단순 reasoning. - Bottleneck (leader 가 모든 write). Leaderless (Dynamo): - 매 write 가 임의 node. - Conflict resolution 필요. - 큰 throughput. → Trade-off. ``` ### CockroachDB / Spanner ``` Range = 64 MB. 매 range 가 own Raft group. 1000 range = 1000 leader (parallel write). → Scale 의 비. ``` ### Distributed lock (Raft 식) ```ts // etcd const lease = await client.lease.grant(10); await client.kv.put('lock', 'value', { lease }); // 다른 client 가 wait await client.watch.compactWatch('lock'); ``` → etcd 의 native support. ### Failure modes ``` - Network slow → timeout / retry. - Network partition → split-brain (rare). - Node crash → leader re-election. - Disk full → write fail. - Clock skew → consensus 어려움 (HLC 사용). ``` ### Monitoring ``` - Leader changes (자주 = 문제). - Log lag (follower 가 leader 보다 뒤). - Quorum size (down node count). - Apply latency. ``` ### Gossip protocol (다른) ``` 모든 node 가 random peer 에 정보. - Cassandra / Consul / Riak 가 사용. - 매 N round = exponential 전파. - Eventually consistent. → Membership / failure detection. ``` → Consensus 와 다름. ### Two-phase commit (2PC) ``` Coordinator + N participants. Phase 1: prepare (lock + log). Phase 2: commit / abort (모두 ack). → Cross-DB transaction. "매 participant 가 OK 면 commit". 함정: - Coordinator down 시 stuck. - 매우 느림. - 큰 system 가 안 사용. ``` → Saga 가 modern alternative. → [[Backend_Saga_Choreography_vs_Orchestration]]. ### Real-world ``` - etcd: K8s 의 brain. - Consul: service mesh. - ZooKeeper: 옛 (Kafka 의 older). - TiKV / CockroachDB: distributed SQL. - Apache BookKeeper: log. - Kafka: 자체 KRaft (ZK 대체). ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | Strong consistency | Raft (etcd, CockroachDB) | | Eventually consistent | Dynamo / Cassandra | | Distributed lock | etcd / Consul | | Service discovery | Consul / etcd | | BFT | Tendermint / blockchain | | 작은 system | Single-node DB | | Cross-DB transaction | Saga (NOT 2PC) | ## ❌ 안티패턴 - **Even node count (4)**: split-brain risk. - **W = N**: 1 down = fail. - **Wall clock 가정 distributed**: HLC 사용. - **2PC 큰 system**: 대안 (saga). - **Manual leader election**: 깨짐 자주. - **No monitoring**: silent. ## 🤖 LLM 활용 힌트 - Raft 가 Paxos 의 modern (easier). - Dynamo 식 = AP (eventual). - R + W > N 가 strong consistency rule. - Odd node count (3, 5, 7). ## 🔗 관련 문서 - [[CS_Distributed_Consensus]] - [[CS_Eventual_Consistency]] - [[CS_Vector_Clocks_Lamport]]