Files
2nd/10_Wiki/Topics/Coding/CS_Distributed_Consensus.md
T
2026-05-09 22:47:42 +09:00

449 lines
9.2 KiB
Markdown

---
id: cs-distributed-consensus
title: Distributed Consensus — Raft / Paxos / Leader Election
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [cs, distributed, consensus, vibe-coding]
tech_stack: { language: "Concept", applicable_to: ["Backend"] }
applied_in: []
aliases: [Raft, Paxos, leader election, etcd, ZooKeeper, consensus, quorum]
---
# Distributed Consensus
> N 노드 가 같은 결정 (leader, value). **Raft (modern, understandable), Paxos (classic), Zab (ZooKeeper)**. etcd / Consul / ZooKeeper 가 implementation. CAP theorem.
## 📖 핵심 개념
- Consensus: 모든 노드 가 같은 value agree.
- Quorum: majority (N/2 + 1).
- Leader election.
- Log replication.
## 💻 코드 패턴
### Why consensus
```
분산 system:
- 어떤 노드 가 primary?
- 어떤 value 가 latest?
- Configuration 변경 동의?
→ Consensus protocol 가 답.
```
### Raft (modern, recommended)
```
Roles:
- Leader: write 받음
- Follower: leader 따름
- Candidate: leader 선출 중
Election:
1. Follower 가 leader heartbeat 안 들음 → candidate
2. Term++ + vote 자기 자신
3. RequestVote RPC 다른 노드
4. Majority vote → leader
5. AppendEntries (heartbeat) 시작
Log replication:
1. Client → leader
2. Leader 가 log 추가
3. AppendEntries → followers
4. Majority ack → committed
5. Leader 가 client respond + apply
```
→ "Understandable" Paxos.
### Raft term
```
Term = monotonic counter.
매 election 가 새 term.
Term 0: 시작
Term 1: leader A
Term 2: leader B (A 가 죽음)
...
```
### Quorum
```
N = 5 nodes.
Majority = 3.
Write quorum: 3 nodes commit
Read quorum: 1 (leader) 또는 모든 nodes (linearizable read)
→ Network partition 시 minority 가 work X.
```
### CAP theorem
```
Consistency: 모든 노드 같은 value.
Availability: 응답 OK.
Partition tolerance: network partition 견딤.
→ Network partition 시 C 또는 A 둘 중.
```
```
CP: ZooKeeper, etcd, MongoDB (default).
AP: Cassandra, DynamoDB.
CA: 단일 노드 (no partition).
```
### etcd (Raft, K8s 의 base)
```bash
# 3 node cluster
etcd \
--name node1 \
--listen-peer-urls http://10.0.0.1:2380 \
--listen-client-urls http://10.0.0.1:2379 \
--initial-advertise-peer-urls http://10.0.0.1:2380 \
--initial-cluster node1=http://10.0.0.1:2380,node2=http://10.0.0.2:2380,node3=http://10.0.0.3:2380 \
--initial-cluster-state new
```
```ts
import { Etcd3 } from 'etcd3';
const client = new Etcd3({ hosts: ['10.0.0.1:2379', '10.0.0.2:2379', '10.0.0.3:2379'] });
// Put
await client.put('/config/feature-x').value('enabled');
// Get
const value = await client.get('/config/feature-x').string();
// Watch
client.watch().key('/config/feature-x').create().then(watcher => {
watcher.on('put', (v) => console.log('Changed:', v.value.toString()));
});
// Lease (TTL)
const lease = await client.lease(60); // 60s
await lease.put('/services/my-app/instance-1').value('healthy');
// Auto delete after 60s without keepalive
```
→ K8s 의 cluster state. Service discovery.
### Consul
```ts
import Consul from 'consul';
const consul = new Consul();
// KV
await consul.kv.set('config/feature-x', 'enabled');
const value = await consul.kv.get('config/feature-x');
// Service registration
await consul.agent.service.register({
name: 'my-app',
id: 'my-app-1',
address: '10.0.0.1',
port: 3000,
check: {
http: 'http://10.0.0.1:3000/health',
interval: '10s',
},
});
// Find service
const services = await consul.health.service('my-app');
```
→ Service discovery + KV. Multi-DC.
### ZooKeeper (Zab)
```bash
# 3 node ZK ensemble.
# Java 기반 (older).
zkCli.sh
> create /myapp/config "value"
> get /myapp/config
> ls /myapp
```
→ Kafka, HBase, Hadoop 의 cluster coord.
### Leader election (Raft / etcd)
```ts
import { Etcd3 } from 'etcd3';
const client = new Etcd3();
const election = client.election('my-leader');
const campaign = election.campaign('node-1');
campaign.on('elected', () => {
console.log('I am leader');
startLeaderWork();
});
campaign.on('error', (err) => {
console.error(err);
});
```
→ 한 노드 만 leader. 나머지 follower.
### Use case — 분산 cron
```
N 노드 의 cron job — 한 번만 실행:
1. Leader election
2. Leader 만 cron schedule
3. Leader 가 죽으면 → election
→ ZooKeeper / etcd / Redis lock.
```
```ts
async function tryBecomeLeader(): Promise<boolean> {
return await election.campaign('cron-leader').then(() => true);
}
if (await tryBecomeLeader()) {
scheduleCron();
}
```
### Distributed lock (etcd / Redis)
```ts
// etcd 의 lock primitives
const lock = client.lock('my-resource');
await lock.acquire();
try {
await doWork();
} finally {
await lock.release();
}
```
```ts
// Redis (Redlock)
import Redlock from 'redlock';
const redlock = new Redlock([redisA, redisB, redisC]);
const resource = await redlock.acquire(['locks:my-resource'], 30_000);
try {
await doWork();
} finally {
await resource.release();
}
```
→ [[DB_Distributed_Locks]].
### Linearizability vs eventual
```
Linearizable: 외부 관찰 = 단일 노드 처럼.
- etcd, ZooKeeper
- Spanner
Eventual: 결국 같음.
- Cassandra
- DynamoDB
→ Trade-off. CP vs AP.
```
### Two Generals / Byzantine
```
Two Generals: network 가 잃기 — agreement 어려움.
Byzantine: nodes 가 거짓 — 더 어려움.
Solutions:
- Raft / Paxos: 정직 노드 가정.
- BFT (Byzantine Fault Tolerance): adversarial 노드 — Bitcoin / Ethereum.
- HotStuff, Tendermint: modern BFT.
```
### Bitcoin consensus (PoW)
```
Bitcoin = Byzantine consensus:
- 1 person = 1 hash (proof of work).
- Longest chain wins.
- Probabilistic finality (6 confirmation).
Energy 비싸 — Ethereum 가 PoS 로 이동.
```
### Etcd vs Consul vs ZooKeeper
```
etcd:
+ K8s native
+ HTTP / gRPC
+ Modern
- 작은 (single purpose)
Consul:
+ Service discovery 강
+ Multi-DC
+ Health check
- 더 큰 dependency
ZooKeeper:
+ Mature (Hadoop / Kafka)
+ 매우 안정
- Java
- Less modern API
```
### Cluster size
```
N = 2: 작동 X (no majority).
N = 3: 1 fail OK.
N = 5: 2 fail OK (큰 cluster 권장).
N = 7: 3 fail OK.
Even N (2, 4, 6) X — 같은 fault tolerance + 더 큰 quorum.
→ 보통 3 또는 5.
```
### Multi-region (cross-DC)
```
ZooKeeper / etcd 가 latency 민감 (consensus 매 write).
Cross-region = 100ms+ — write 매우 느림.
해결:
- 단일 region quorum
- 다른 region = read replica (eventually consistent)
```
### Operation
```
- Backup (regular snapshot)
- Disaster recovery (config restore)
- Monitoring (leader change, lag)
- Upgrade (rolling restart)
- Compaction (옛 log 정리)
```
### Failure scenarios
```
1. Leader 죽음 → election (5-10s)
2. Network partition → minority 가 work X
3. All majority 죽음 → cluster down
4. Disk full → write fail
5. Clock skew → election issue
```
### Real-world apps
```
K8s: etcd
Consul: service mesh / discovery
ZK: Kafka, Hadoop, HBase
Apache Kafka: 자체 Raft (KRaft, 2024+)
CockroachDB: 자체 Raft
TiDB: PD (자체 Raft)
```
### Implementing Raft (학습)
```
Raft paper: https://raft.github.io
Visualization: https://thesecretlivesofdata.com/raft/
자체 implement = 학습 (production 에 안 쓰지 X).
hashicorp/raft (Go), MIT 6.824 lab.
```
### When NOT to use
```
- Single node 충분 (작은 app)
- Stateless service (no consensus 필요)
- 단순 leader 만 — Redis lock 충분
- Strong consistency 안 필요 — eventual OK
```
### Saga (consensus 가 아닌 alternative)
```
Distributed transaction:
- 2PC: blocking, slow
- Saga: compensating, fast
→ [[Backend_Saga_Patterns]].
```
### Modern: KRaft (Kafka)
```
Kafka 가 ZooKeeper 의존 → KRaft (자체 Raft, 2024).
Single binary. 더 단순 ops.
```
### Time
```
Leader election: 5-10s (default Raft).
Write commit: 1-10ms (single DC).
Cross-DC: 100ms+.
→ 빠른 = 같은 DC.
```
### Use cases
```
✅ Service discovery
✅ Configuration store
✅ Leader election (distributed cron)
✅ Distributed lock
✅ Coordination (cluster size)
✅ K8s state
❌ High-throughput data (Cassandra)
❌ Big files (S3)
❌ Cache (Redis)
```
### Failure tolerance
```
3 node etcd: 1 failure OK.
실제 3 fail = data loss 위험.
→ 3+ node 권장. 5 가 stable.
```
### Learning resources
```
- Raft paper (raft.github.io)
- "The Secret Lives of Data" (visual)
- Designing Data-Intensive Applications (book)
- Distributed Systems by Tanenbaum
- etcd / Consul docs
```
## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| K8s | etcd (built-in) |
| Service discovery | Consul |
| Java ecosystem | ZooKeeper |
| Distributed lock | etcd / Redis Redlock |
| Cluster state | etcd / Consul |
| 작은 + 단순 | Redis lock |
## ❌ 안티패턴
- **2 node consensus**: no majority.
- **Even N**: same fault tolerance + 더 큰 quorum.
- **Cross-region single quorum**: write 매우 느림.
- **Disk full 무 monitoring**: leader stuck.
- **Backup 무**: snapshot lost = cluster lost.
- **모든 거 etcd**: high-throughput 안 적합.
## 🤖 LLM 활용 힌트
- 3 또는 5 node.
- Raft 가 modern.
- etcd / Consul = standard.
- Cross-region = 단일 region quorum + read replica.
## 🔗 관련 문서
- [[CS_Eventual_Consistency]]
- [[DB_Distributed_Locks]]
- [[Backend_Service_Discovery]]