--- id: cs-distributed-consensus title: Distributed Consensus — Raft / Paxos / Leader Election category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [cs, distributed, consensus, vibe-coding] tech_stack: { language: "Concept", applicable_to: ["Backend"] } applied_in: [] aliases: [Raft, Paxos, leader election, etcd, ZooKeeper, consensus, quorum] --- # Distributed Consensus > N 노드 가 같은 결정 (leader, value). **Raft (modern, understandable), Paxos (classic), Zab (ZooKeeper)**. etcd / Consul / ZooKeeper 가 implementation. CAP theorem. ## 📖 핵심 개념 - Consensus: 모든 노드 가 같은 value agree. - Quorum: majority (N/2 + 1). - Leader election. - Log replication. ## 💻 코드 패턴 ### Why consensus ``` 분산 system: - 어떤 노드 가 primary? - 어떤 value 가 latest? - Configuration 변경 동의? → Consensus protocol 가 답. ``` ### Raft (modern, recommended) ``` Roles: - Leader: write 받음 - Follower: leader 따름 - Candidate: leader 선출 중 Election: 1. Follower 가 leader heartbeat 안 들음 → candidate 2. Term++ + vote 자기 자신 3. RequestVote RPC 다른 노드 4. Majority vote → leader 5. AppendEntries (heartbeat) 시작 Log replication: 1. Client → leader 2. Leader 가 log 추가 3. AppendEntries → followers 4. Majority ack → committed 5. Leader 가 client respond + apply ``` → "Understandable" Paxos. ### Raft term ``` Term = monotonic counter. 매 election 가 새 term. Term 0: 시작 Term 1: leader A Term 2: leader B (A 가 죽음) ... ``` ### Quorum ``` N = 5 nodes. Majority = 3. Write quorum: 3 nodes commit Read quorum: 1 (leader) 또는 모든 nodes (linearizable read) → Network partition 시 minority 가 work X. ``` ### CAP theorem ``` Consistency: 모든 노드 같은 value. Availability: 응답 OK. Partition tolerance: network partition 견딤. → Network partition 시 C 또는 A 둘 중. ``` ``` CP: ZooKeeper, etcd, MongoDB (default). AP: Cassandra, DynamoDB. CA: 단일 노드 (no partition). ``` ### etcd (Raft, K8s 의 base) ```bash # 3 node cluster etcd \ --name node1 \ --listen-peer-urls http://10.0.0.1:2380 \ --listen-client-urls http://10.0.0.1:2379 \ --initial-advertise-peer-urls http://10.0.0.1:2380 \ --initial-cluster node1=http://10.0.0.1:2380,node2=http://10.0.0.2:2380,node3=http://10.0.0.3:2380 \ --initial-cluster-state new ``` ```ts import { Etcd3 } from 'etcd3'; const client = new Etcd3({ hosts: ['10.0.0.1:2379', '10.0.0.2:2379', '10.0.0.3:2379'] }); // Put await client.put('/config/feature-x').value('enabled'); // Get const value = await client.get('/config/feature-x').string(); // Watch client.watch().key('/config/feature-x').create().then(watcher => { watcher.on('put', (v) => console.log('Changed:', v.value.toString())); }); // Lease (TTL) const lease = await client.lease(60); // 60s await lease.put('/services/my-app/instance-1').value('healthy'); // Auto delete after 60s without keepalive ``` → K8s 의 cluster state. Service discovery. ### Consul ```ts import Consul from 'consul'; const consul = new Consul(); // KV await consul.kv.set('config/feature-x', 'enabled'); const value = await consul.kv.get('config/feature-x'); // Service registration await consul.agent.service.register({ name: 'my-app', id: 'my-app-1', address: '10.0.0.1', port: 3000, check: { http: 'http://10.0.0.1:3000/health', interval: '10s', }, }); // Find service const services = await consul.health.service('my-app'); ``` → Service discovery + KV. Multi-DC. ### ZooKeeper (Zab) ```bash # 3 node ZK ensemble. # Java 기반 (older). zkCli.sh > create /myapp/config "value" > get /myapp/config > ls /myapp ``` → Kafka, HBase, Hadoop 의 cluster coord. ### Leader election (Raft / etcd) ```ts import { Etcd3 } from 'etcd3'; const client = new Etcd3(); const election = client.election('my-leader'); const campaign = election.campaign('node-1'); campaign.on('elected', () => { console.log('I am leader'); startLeaderWork(); }); campaign.on('error', (err) => { console.error(err); }); ``` → 한 노드 만 leader. 나머지 follower. ### Use case — 분산 cron ``` N 노드 의 cron job — 한 번만 실행: 1. Leader election 2. Leader 만 cron schedule 3. Leader 가 죽으면 → election → ZooKeeper / etcd / Redis lock. ``` ```ts async function tryBecomeLeader(): Promise { return await election.campaign('cron-leader').then(() => true); } if (await tryBecomeLeader()) { scheduleCron(); } ``` ### Distributed lock (etcd / Redis) ```ts // etcd 의 lock primitives const lock = client.lock('my-resource'); await lock.acquire(); try { await doWork(); } finally { await lock.release(); } ``` ```ts // Redis (Redlock) import Redlock from 'redlock'; const redlock = new Redlock([redisA, redisB, redisC]); const resource = await redlock.acquire(['locks:my-resource'], 30_000); try { await doWork(); } finally { await resource.release(); } ``` → [[DB_Distributed_Locks]]. ### Linearizability vs eventual ``` Linearizable: 외부 관찰 = 단일 노드 처럼. - etcd, ZooKeeper - Spanner Eventual: 결국 같음. - Cassandra - DynamoDB → Trade-off. CP vs AP. ``` ### Two Generals / Byzantine ``` Two Generals: network 가 잃기 — agreement 어려움. Byzantine: nodes 가 거짓 — 더 어려움. Solutions: - Raft / Paxos: 정직 노드 가정. - BFT (Byzantine Fault Tolerance): adversarial 노드 — Bitcoin / Ethereum. - HotStuff, Tendermint: modern BFT. ``` ### Bitcoin consensus (PoW) ``` Bitcoin = Byzantine consensus: - 1 person = 1 hash (proof of work). - Longest chain wins. - Probabilistic finality (6 confirmation). Energy 비싸 — Ethereum 가 PoS 로 이동. ``` ### Etcd vs Consul vs ZooKeeper ``` etcd: + K8s native + HTTP / gRPC + Modern - 작은 (single purpose) Consul: + Service discovery 강 + Multi-DC + Health check - 더 큰 dependency ZooKeeper: + Mature (Hadoop / Kafka) + 매우 안정 - Java - Less modern API ``` ### Cluster size ``` N = 2: 작동 X (no majority). N = 3: 1 fail OK. N = 5: 2 fail OK (큰 cluster 권장). N = 7: 3 fail OK. Even N (2, 4, 6) X — 같은 fault tolerance + 더 큰 quorum. → 보통 3 또는 5. ``` ### Multi-region (cross-DC) ``` ZooKeeper / etcd 가 latency 민감 (consensus 매 write). Cross-region = 100ms+ — write 매우 느림. 해결: - 단일 region quorum - 다른 region = read replica (eventually consistent) ``` ### Operation ``` - Backup (regular snapshot) - Disaster recovery (config restore) - Monitoring (leader change, lag) - Upgrade (rolling restart) - Compaction (옛 log 정리) ``` ### Failure scenarios ``` 1. Leader 죽음 → election (5-10s) 2. Network partition → minority 가 work X 3. All majority 죽음 → cluster down 4. Disk full → write fail 5. Clock skew → election issue ``` ### Real-world apps ``` K8s: etcd Consul: service mesh / discovery ZK: Kafka, Hadoop, HBase Apache Kafka: 자체 Raft (KRaft, 2024+) CockroachDB: 자체 Raft TiDB: PD (자체 Raft) ``` ### Implementing Raft (학습) ``` Raft paper: https://raft.github.io Visualization: https://thesecretlivesofdata.com/raft/ 자체 implement = 학습 (production 에 안 쓰지 X). hashicorp/raft (Go), MIT 6.824 lab. ``` ### When NOT to use ``` - Single node 충분 (작은 app) - Stateless service (no consensus 필요) - 단순 leader 만 — Redis lock 충분 - Strong consistency 안 필요 — eventual OK ``` ### Saga (consensus 가 아닌 alternative) ``` Distributed transaction: - 2PC: blocking, slow - Saga: compensating, fast → [[Backend_Saga_Patterns]]. ``` ### Modern: KRaft (Kafka) ``` Kafka 가 ZooKeeper 의존 → KRaft (자체 Raft, 2024). Single binary. 더 단순 ops. ``` ### Time ``` Leader election: 5-10s (default Raft). Write commit: 1-10ms (single DC). Cross-DC: 100ms+. → 빠른 = 같은 DC. ``` ### Use cases ``` ✅ Service discovery ✅ Configuration store ✅ Leader election (distributed cron) ✅ Distributed lock ✅ Coordination (cluster size) ✅ K8s state ❌ High-throughput data (Cassandra) ❌ Big files (S3) ❌ Cache (Redis) ``` ### Failure tolerance ``` 3 node etcd: 1 failure OK. 실제 3 fail = data loss 위험. → 3+ node 권장. 5 가 stable. ``` ### Learning resources ``` - Raft paper (raft.github.io) - "The Secret Lives of Data" (visual) - Designing Data-Intensive Applications (book) - Distributed Systems by Tanenbaum - etcd / Consul docs ``` ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | K8s | etcd (built-in) | | Service discovery | Consul | | Java ecosystem | ZooKeeper | | Distributed lock | etcd / Redis Redlock | | Cluster state | etcd / Consul | | 작은 + 단순 | Redis lock | ## ❌ 안티패턴 - **2 node consensus**: no majority. - **Even N**: same fault tolerance + 더 큰 quorum. - **Cross-region single quorum**: write 매우 느림. - **Disk full 무 monitoring**: leader stuck. - **Backup 무**: snapshot lost = cluster lost. - **모든 거 etcd**: high-throughput 안 적합. ## 🤖 LLM 활용 힌트 - 3 또는 5 node. - Raft 가 modern. - etcd / Consul = standard. - Cross-region = 단일 region quorum + read replica. ## 🔗 관련 문서 - [[CS_Eventual_Consistency]] - [[DB_Distributed_Locks]] - [[Backend_Service_Discovery]]