449 lines
9.2 KiB
Markdown
449 lines
9.2 KiB
Markdown
---
|
|
id: cs-distributed-consensus
|
|
title: Distributed Consensus — Raft / Paxos / Leader Election
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [cs, distributed, consensus, vibe-coding]
|
|
tech_stack: { language: "Concept", applicable_to: ["Backend"] }
|
|
applied_in: []
|
|
aliases: [Raft, Paxos, leader election, etcd, ZooKeeper, consensus, quorum]
|
|
---
|
|
|
|
# Distributed Consensus
|
|
|
|
> N 노드 가 같은 결정 (leader, value). **Raft (modern, understandable), Paxos (classic), Zab (ZooKeeper)**. etcd / Consul / ZooKeeper 가 implementation. CAP theorem.
|
|
|
|
## 📖 핵심 개념
|
|
- Consensus: 모든 노드 가 같은 value agree.
|
|
- Quorum: majority (N/2 + 1).
|
|
- Leader election.
|
|
- Log replication.
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### Why consensus
|
|
```
|
|
분산 system:
|
|
- 어떤 노드 가 primary?
|
|
- 어떤 value 가 latest?
|
|
- Configuration 변경 동의?
|
|
|
|
→ Consensus protocol 가 답.
|
|
```
|
|
|
|
### Raft (modern, recommended)
|
|
```
|
|
Roles:
|
|
- Leader: write 받음
|
|
- Follower: leader 따름
|
|
- Candidate: leader 선출 중
|
|
|
|
Election:
|
|
1. Follower 가 leader heartbeat 안 들음 → candidate
|
|
2. Term++ + vote 자기 자신
|
|
3. RequestVote RPC 다른 노드
|
|
4. Majority vote → leader
|
|
5. AppendEntries (heartbeat) 시작
|
|
|
|
Log replication:
|
|
1. Client → leader
|
|
2. Leader 가 log 추가
|
|
3. AppendEntries → followers
|
|
4. Majority ack → committed
|
|
5. Leader 가 client respond + apply
|
|
```
|
|
|
|
→ "Understandable" Paxos.
|
|
|
|
### Raft term
|
|
```
|
|
Term = monotonic counter.
|
|
매 election 가 새 term.
|
|
|
|
Term 0: 시작
|
|
Term 1: leader A
|
|
Term 2: leader B (A 가 죽음)
|
|
...
|
|
```
|
|
|
|
### Quorum
|
|
```
|
|
N = 5 nodes.
|
|
Majority = 3.
|
|
|
|
Write quorum: 3 nodes commit
|
|
Read quorum: 1 (leader) 또는 모든 nodes (linearizable read)
|
|
|
|
→ Network partition 시 minority 가 work X.
|
|
```
|
|
|
|
### CAP theorem
|
|
```
|
|
Consistency: 모든 노드 같은 value.
|
|
Availability: 응답 OK.
|
|
Partition tolerance: network partition 견딤.
|
|
|
|
→ Network partition 시 C 또는 A 둘 중.
|
|
```
|
|
|
|
```
|
|
CP: ZooKeeper, etcd, MongoDB (default).
|
|
AP: Cassandra, DynamoDB.
|
|
CA: 단일 노드 (no partition).
|
|
```
|
|
|
|
### etcd (Raft, K8s 의 base)
|
|
```bash
|
|
# 3 node cluster
|
|
etcd \
|
|
--name node1 \
|
|
--listen-peer-urls http://10.0.0.1:2380 \
|
|
--listen-client-urls http://10.0.0.1:2379 \
|
|
--initial-advertise-peer-urls http://10.0.0.1:2380 \
|
|
--initial-cluster node1=http://10.0.0.1:2380,node2=http://10.0.0.2:2380,node3=http://10.0.0.3:2380 \
|
|
--initial-cluster-state new
|
|
```
|
|
|
|
```ts
|
|
import { Etcd3 } from 'etcd3';
|
|
|
|
const client = new Etcd3({ hosts: ['10.0.0.1:2379', '10.0.0.2:2379', '10.0.0.3:2379'] });
|
|
|
|
// Put
|
|
await client.put('/config/feature-x').value('enabled');
|
|
|
|
// Get
|
|
const value = await client.get('/config/feature-x').string();
|
|
|
|
// Watch
|
|
client.watch().key('/config/feature-x').create().then(watcher => {
|
|
watcher.on('put', (v) => console.log('Changed:', v.value.toString()));
|
|
});
|
|
|
|
// Lease (TTL)
|
|
const lease = await client.lease(60); // 60s
|
|
await lease.put('/services/my-app/instance-1').value('healthy');
|
|
// Auto delete after 60s without keepalive
|
|
```
|
|
|
|
→ K8s 의 cluster state. Service discovery.
|
|
|
|
### Consul
|
|
```ts
|
|
import Consul from 'consul';
|
|
|
|
const consul = new Consul();
|
|
|
|
// KV
|
|
await consul.kv.set('config/feature-x', 'enabled');
|
|
const value = await consul.kv.get('config/feature-x');
|
|
|
|
// Service registration
|
|
await consul.agent.service.register({
|
|
name: 'my-app',
|
|
id: 'my-app-1',
|
|
address: '10.0.0.1',
|
|
port: 3000,
|
|
check: {
|
|
http: 'http://10.0.0.1:3000/health',
|
|
interval: '10s',
|
|
},
|
|
});
|
|
|
|
// Find service
|
|
const services = await consul.health.service('my-app');
|
|
```
|
|
|
|
→ Service discovery + KV. Multi-DC.
|
|
|
|
### ZooKeeper (Zab)
|
|
```bash
|
|
# 3 node ZK ensemble.
|
|
# Java 기반 (older).
|
|
|
|
zkCli.sh
|
|
> create /myapp/config "value"
|
|
> get /myapp/config
|
|
> ls /myapp
|
|
```
|
|
|
|
→ Kafka, HBase, Hadoop 의 cluster coord.
|
|
|
|
### Leader election (Raft / etcd)
|
|
```ts
|
|
import { Etcd3 } from 'etcd3';
|
|
|
|
const client = new Etcd3();
|
|
const election = client.election('my-leader');
|
|
const campaign = election.campaign('node-1');
|
|
|
|
campaign.on('elected', () => {
|
|
console.log('I am leader');
|
|
startLeaderWork();
|
|
});
|
|
|
|
campaign.on('error', (err) => {
|
|
console.error(err);
|
|
});
|
|
```
|
|
|
|
→ 한 노드 만 leader. 나머지 follower.
|
|
|
|
### Use case — 분산 cron
|
|
```
|
|
N 노드 의 cron job — 한 번만 실행:
|
|
|
|
1. Leader election
|
|
2. Leader 만 cron schedule
|
|
3. Leader 가 죽으면 → election
|
|
|
|
→ ZooKeeper / etcd / Redis lock.
|
|
```
|
|
|
|
```ts
|
|
async function tryBecomeLeader(): Promise<boolean> {
|
|
return await election.campaign('cron-leader').then(() => true);
|
|
}
|
|
|
|
if (await tryBecomeLeader()) {
|
|
scheduleCron();
|
|
}
|
|
```
|
|
|
|
### Distributed lock (etcd / Redis)
|
|
```ts
|
|
// etcd 의 lock primitives
|
|
const lock = client.lock('my-resource');
|
|
await lock.acquire();
|
|
try {
|
|
await doWork();
|
|
} finally {
|
|
await lock.release();
|
|
}
|
|
```
|
|
|
|
```ts
|
|
// Redis (Redlock)
|
|
import Redlock from 'redlock';
|
|
|
|
const redlock = new Redlock([redisA, redisB, redisC]);
|
|
const resource = await redlock.acquire(['locks:my-resource'], 30_000);
|
|
try {
|
|
await doWork();
|
|
} finally {
|
|
await resource.release();
|
|
}
|
|
```
|
|
|
|
→ [[DB_Distributed_Locks]].
|
|
|
|
### Linearizability vs eventual
|
|
```
|
|
Linearizable: 외부 관찰 = 단일 노드 처럼.
|
|
- etcd, ZooKeeper
|
|
- Spanner
|
|
|
|
Eventual: 결국 같음.
|
|
- Cassandra
|
|
- DynamoDB
|
|
|
|
→ Trade-off. CP vs AP.
|
|
```
|
|
|
|
### Two Generals / Byzantine
|
|
```
|
|
Two Generals: network 가 잃기 — agreement 어려움.
|
|
Byzantine: nodes 가 거짓 — 더 어려움.
|
|
|
|
Solutions:
|
|
- Raft / Paxos: 정직 노드 가정.
|
|
- BFT (Byzantine Fault Tolerance): adversarial 노드 — Bitcoin / Ethereum.
|
|
- HotStuff, Tendermint: modern BFT.
|
|
```
|
|
|
|
### Bitcoin consensus (PoW)
|
|
```
|
|
Bitcoin = Byzantine consensus:
|
|
- 1 person = 1 hash (proof of work).
|
|
- Longest chain wins.
|
|
- Probabilistic finality (6 confirmation).
|
|
|
|
Energy 비싸 — Ethereum 가 PoS 로 이동.
|
|
```
|
|
|
|
### Etcd vs Consul vs ZooKeeper
|
|
```
|
|
etcd:
|
|
+ K8s native
|
|
+ HTTP / gRPC
|
|
+ Modern
|
|
- 작은 (single purpose)
|
|
|
|
Consul:
|
|
+ Service discovery 강
|
|
+ Multi-DC
|
|
+ Health check
|
|
- 더 큰 dependency
|
|
|
|
ZooKeeper:
|
|
+ Mature (Hadoop / Kafka)
|
|
+ 매우 안정
|
|
- Java
|
|
- Less modern API
|
|
```
|
|
|
|
### Cluster size
|
|
```
|
|
N = 2: 작동 X (no majority).
|
|
N = 3: 1 fail OK.
|
|
N = 5: 2 fail OK (큰 cluster 권장).
|
|
N = 7: 3 fail OK.
|
|
|
|
Even N (2, 4, 6) X — 같은 fault tolerance + 더 큰 quorum.
|
|
|
|
→ 보통 3 또는 5.
|
|
```
|
|
|
|
### Multi-region (cross-DC)
|
|
```
|
|
ZooKeeper / etcd 가 latency 민감 (consensus 매 write).
|
|
Cross-region = 100ms+ — write 매우 느림.
|
|
|
|
해결:
|
|
- 단일 region quorum
|
|
- 다른 region = read replica (eventually consistent)
|
|
```
|
|
|
|
### Operation
|
|
```
|
|
- Backup (regular snapshot)
|
|
- Disaster recovery (config restore)
|
|
- Monitoring (leader change, lag)
|
|
- Upgrade (rolling restart)
|
|
- Compaction (옛 log 정리)
|
|
```
|
|
|
|
### Failure scenarios
|
|
```
|
|
1. Leader 죽음 → election (5-10s)
|
|
2. Network partition → minority 가 work X
|
|
3. All majority 죽음 → cluster down
|
|
4. Disk full → write fail
|
|
5. Clock skew → election issue
|
|
```
|
|
|
|
### Real-world apps
|
|
```
|
|
K8s: etcd
|
|
Consul: service mesh / discovery
|
|
ZK: Kafka, Hadoop, HBase
|
|
Apache Kafka: 자체 Raft (KRaft, 2024+)
|
|
CockroachDB: 자체 Raft
|
|
TiDB: PD (자체 Raft)
|
|
```
|
|
|
|
### Implementing Raft (학습)
|
|
```
|
|
Raft paper: https://raft.github.io
|
|
Visualization: https://thesecretlivesofdata.com/raft/
|
|
|
|
자체 implement = 학습 (production 에 안 쓰지 X).
|
|
hashicorp/raft (Go), MIT 6.824 lab.
|
|
```
|
|
|
|
### When NOT to use
|
|
```
|
|
- Single node 충분 (작은 app)
|
|
- Stateless service (no consensus 필요)
|
|
- 단순 leader 만 — Redis lock 충분
|
|
- Strong consistency 안 필요 — eventual OK
|
|
```
|
|
|
|
### Saga (consensus 가 아닌 alternative)
|
|
```
|
|
Distributed transaction:
|
|
- 2PC: blocking, slow
|
|
- Saga: compensating, fast
|
|
|
|
→ [[Backend_Saga_Patterns]].
|
|
```
|
|
|
|
### Modern: KRaft (Kafka)
|
|
```
|
|
Kafka 가 ZooKeeper 의존 → KRaft (자체 Raft, 2024).
|
|
Single binary. 더 단순 ops.
|
|
```
|
|
|
|
### Time
|
|
```
|
|
Leader election: 5-10s (default Raft).
|
|
Write commit: 1-10ms (single DC).
|
|
Cross-DC: 100ms+.
|
|
|
|
→ 빠른 = 같은 DC.
|
|
```
|
|
|
|
### Use cases
|
|
```
|
|
✅ Service discovery
|
|
✅ Configuration store
|
|
✅ Leader election (distributed cron)
|
|
✅ Distributed lock
|
|
✅ Coordination (cluster size)
|
|
✅ K8s state
|
|
|
|
❌ High-throughput data (Cassandra)
|
|
❌ Big files (S3)
|
|
❌ Cache (Redis)
|
|
```
|
|
|
|
### Failure tolerance
|
|
```
|
|
3 node etcd: 1 failure OK.
|
|
실제 3 fail = data loss 위험.
|
|
|
|
→ 3+ node 권장. 5 가 stable.
|
|
```
|
|
|
|
### Learning resources
|
|
```
|
|
- Raft paper (raft.github.io)
|
|
- "The Secret Lives of Data" (visual)
|
|
- Designing Data-Intensive Applications (book)
|
|
- Distributed Systems by Tanenbaum
|
|
- etcd / Consul docs
|
|
```
|
|
|
|
## 🤔 의사결정 기준
|
|
| 작업 | 추천 |
|
|
|---|---|
|
|
| K8s | etcd (built-in) |
|
|
| Service discovery | Consul |
|
|
| Java ecosystem | ZooKeeper |
|
|
| Distributed lock | etcd / Redis Redlock |
|
|
| Cluster state | etcd / Consul |
|
|
| 작은 + 단순 | Redis lock |
|
|
|
|
## ❌ 안티패턴
|
|
- **2 node consensus**: no majority.
|
|
- **Even N**: same fault tolerance + 더 큰 quorum.
|
|
- **Cross-region single quorum**: write 매우 느림.
|
|
- **Disk full 무 monitoring**: leader stuck.
|
|
- **Backup 무**: snapshot lost = cluster lost.
|
|
- **모든 거 etcd**: high-throughput 안 적합.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- 3 또는 5 node.
|
|
- Raft 가 modern.
|
|
- etcd / Consul = standard.
|
|
- Cross-region = 단일 region quorum + read replica.
|
|
|
|
## 🔗 관련 문서
|
|
- [[CS_Eventual_Consistency]]
|
|
- [[DB_Distributed_Locks]]
|
|
- [[Backend_Service_Discovery]]
|