2nd/10_Wiki/Topics/Coding/CS_Distributed_Consensus.md

---
id: cs-distributed-consensus
title: Distributed Consensus — Raft / Paxos / Leader Election
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [cs, distributed, consensus, vibe-coding]
tech_stack: { language: "Concept", applicable_to: ["Backend"] }
applied_in: []
aliases: [Raft, Paxos, leader election, etcd, ZooKeeper, consensus, quorum]
---

# Distributed Consensus

> N 노드 가 같은 결정 (leader, value). **Raft (modern, understandable), Paxos (classic), Zab (ZooKeeper)**. etcd / Consul / ZooKeeper 가 implementation. CAP theorem.

## 📖 핵심 개념
- Consensus: 모든 노드 가 같은 value agree.
- Quorum: majority (N/2 + 1).
- Leader election.
- Log replication.

## 💻 코드 패턴

### Why consensus
```
분산 system:
- 어떤 노드 가 primary?
- 어떤 value 가 latest?
- Configuration 변경 동의?

→ Consensus protocol 가 답.
```

### Raft (modern, recommended)
```
Roles:
- Leader: write 받음
- Follower: leader 따름
- Candidate: leader 선출 중

Election:
1. Follower 가 leader heartbeat 안 들음 → candidate
2. Term++ + vote 자기 자신
3. RequestVote RPC 다른 노드
4. Majority vote → leader
5. AppendEntries (heartbeat) 시작

Log replication:
1. Client → leader
2. Leader 가 log 추가
3. AppendEntries → followers
4. Majority ack → committed
5. Leader 가 client respond + apply
```

→ "Understandable" Paxos.

### Raft term
```
Term = monotonic counter.
매 election 가 새 term.

Term 0: 시작
Term 1: leader A
Term 2: leader B (A 가 죽음)
...
```

### Quorum
```
N = 5 nodes.
Majority = 3.

Write quorum: 3 nodes commit
Read quorum: 1 (leader) 또는 모든 nodes (linearizable read)

→ Network partition 시 minority 가 work X.
```

### CAP theorem
```
Consistency: 모든 노드 같은 value.
Availability: 응답 OK.
Partition tolerance: network partition 견딤.

→ Network partition 시 C 또는 A 둘 중.
```

```
CP:  ZooKeeper, etcd, MongoDB (default).
AP:  Cassandra, DynamoDB.
CA:  단일 노드 (no partition).
```

### etcd (Raft, K8s 의 base)
```bash
# 3 node cluster
etcd \
  --name node1 \
  --listen-peer-urls http://10.0.0.1:2380 \
  --listen-client-urls http://10.0.0.1:2379 \
  --initial-advertise-peer-urls http://10.0.0.1:2380 \
  --initial-cluster node1=http://10.0.0.1:2380,node2=http://10.0.0.2:2380,node3=http://10.0.0.3:2380 \
  --initial-cluster-state new
```

```ts
import { Etcd3 } from 'etcd3';

const client = new Etcd3({ hosts: ['10.0.0.1:2379', '10.0.0.2:2379', '10.0.0.3:2379'] });

// Put
await client.put('/config/feature-x').value('enabled');

// Get
const value = await client.get('/config/feature-x').string();

// Watch
client.watch().key('/config/feature-x').create().then(watcher => {
  watcher.on('put', (v) => console.log('Changed:', v.value.toString()));
});

// Lease (TTL)
const lease = await client.lease(60);  // 60s
await lease.put('/services/my-app/instance-1').value('healthy');
// Auto delete after 60s without keepalive
```

→ K8s 의 cluster state. Service discovery.

### Consul
```ts
import Consul from 'consul';

const consul = new Consul();

// KV
await consul.kv.set('config/feature-x', 'enabled');
const value = await consul.kv.get('config/feature-x');

// Service registration
await consul.agent.service.register({
  name: 'my-app',
  id: 'my-app-1',
  address: '10.0.0.1',
  port: 3000,
  check: {
    http: 'http://10.0.0.1:3000/health',
    interval: '10s',
  },
});

// Find service
const services = await consul.health.service('my-app');
```

→ Service discovery + KV. Multi-DC.

### ZooKeeper (Zab)
```bash
# 3 node ZK ensemble.
# Java 기반 (older).

zkCli.sh
> create /myapp/config "value"
> get /myapp/config
> ls /myapp
```

→ Kafka, HBase, Hadoop 의 cluster coord.

### Leader election (Raft / etcd)
```ts
import { Etcd3 } from 'etcd3';

const client = new Etcd3();
const election = client.election('my-leader');
const campaign = election.campaign('node-1');

campaign.on('elected', () => {
  console.log('I am leader');
  startLeaderWork();
});

campaign.on('error', (err) => {
  console.error(err);
});
```

→ 한 노드 만 leader. 나머지 follower.

### Use case — 분산 cron
```
N 노드 의 cron job — 한 번만 실행:

1. Leader election
2. Leader 만 cron schedule
3. Leader 가 죽으면 → election

→ ZooKeeper / etcd / Redis lock.
```

```ts
async function tryBecomeLeader(): Promise<boolean> {
  return await election.campaign('cron-leader').then(() => true);
}

if (await tryBecomeLeader()) {
  scheduleCron();
}
```

### Distributed lock (etcd / Redis)
```ts
// etcd 의 lock primitives
const lock = client.lock('my-resource');
await lock.acquire();
try {
  await doWork();
} finally {
  await lock.release();
}
```

```ts
// Redis (Redlock)
import Redlock from 'redlock';

const redlock = new Redlock([redisA, redisB, redisC]);
const resource = await redlock.acquire(['locks:my-resource'], 30_000);
try {
  await doWork();
} finally {
  await resource.release();
}
```

→ [[DB_Distributed_Locks]].

### Linearizability vs eventual
```
Linearizable: 외부 관찰 = 단일 노드 처럼.
- etcd, ZooKeeper
- Spanner

Eventual: 결국 같음.
- Cassandra
- DynamoDB

→ Trade-off. CP vs AP.
```

### Two Generals / Byzantine
```
Two Generals: network 가 잃기 — agreement 어려움.
Byzantine: nodes 가 거짓 — 더 어려움.

Solutions:
- Raft / Paxos: 정직 노드 가정.
- BFT (Byzantine Fault Tolerance): adversarial 노드 — Bitcoin / Ethereum.
- HotStuff, Tendermint: modern BFT.
```

### Bitcoin consensus (PoW)
```
Bitcoin = Byzantine consensus:
- 1 person = 1 hash (proof of work).
- Longest chain wins.
- Probabilistic finality (6 confirmation).

Energy 비싸 — Ethereum 가 PoS 로 이동.
```

### Etcd vs Consul vs ZooKeeper
```
etcd:
+ K8s native
+ HTTP / gRPC
+ Modern
- 작은 (single purpose)

Consul:
+ Service discovery 강
+ Multi-DC
+ Health check
- 더 큰 dependency

ZooKeeper:
+ Mature (Hadoop / Kafka)
+ 매우 안정
- Java
- Less modern API
```

### Cluster size
```
N = 2: 작동 X (no majority).
N = 3: 1 fail OK.
N = 5: 2 fail OK (큰 cluster 권장).
N = 7: 3 fail OK.

Even N (2, 4, 6) X — 같은 fault tolerance + 더 큰 quorum.

→ 보통 3 또는 5.
```

### Multi-region (cross-DC)
```
ZooKeeper / etcd 가 latency 민감 (consensus 매 write).
Cross-region = 100ms+ — write 매우 느림.

해결:
- 단일 region quorum
- 다른 region = read replica (eventually consistent)
```

### Operation
```
- Backup (regular snapshot)
- Disaster recovery (config restore)
- Monitoring (leader change, lag)
- Upgrade (rolling restart)
- Compaction (옛 log 정리)
```

### Failure scenarios
```
1. Leader 죽음 → election (5-10s)
2. Network partition → minority 가 work X
3. All majority 죽음 → cluster down
4. Disk full → write fail
5. Clock skew → election issue
```

### Real-world apps
```
K8s:        etcd
Consul:     service mesh / discovery
ZK:         Kafka, Hadoop, HBase
Apache Kafka: 자체 Raft (KRaft, 2024+)
CockroachDB: 자체 Raft
TiDB:       PD (자체 Raft)
```

### Implementing Raft (학습)
```
Raft paper: https://raft.github.io
Visualization: https://thesecretlivesofdata.com/raft/

자체 implement = 학습 (production 에 안 쓰지 X).
hashicorp/raft (Go), MIT 6.824 lab.
```

### When NOT to use
```
- Single node 충분 (작은 app)
- Stateless service (no consensus 필요)
- 단순 leader 만 — Redis lock 충분
- Strong consistency 안 필요 — eventual OK
```

### Saga (consensus 가 아닌 alternative)
```
Distributed transaction:
- 2PC: blocking, slow
- Saga: compensating, fast

→ [[Backend_Saga_Patterns]].
```

### Modern: KRaft (Kafka)
```
Kafka 가 ZooKeeper 의존 → KRaft (자체 Raft, 2024).
Single binary. 더 단순 ops.
```

### Time
```
Leader election: 5-10s (default Raft).
Write commit: 1-10ms (single DC).
Cross-DC: 100ms+.

→ 빠른 = 같은 DC.
```

### Use cases
```
✅ Service discovery
✅ Configuration store
✅ Leader election (distributed cron)
✅ Distributed lock
✅ Coordination (cluster size)
✅ K8s state

❌ High-throughput data (Cassandra)
❌ Big files (S3)
❌ Cache (Redis)
```

### Failure tolerance
```
3 node etcd: 1 failure OK.
실제 3 fail = data loss 위험.

→ 3+ node 권장. 5 가 stable.
```

### Learning resources
```
- Raft paper (raft.github.io)
- "The Secret Lives of Data" (visual)
- Designing Data-Intensive Applications (book)
- Distributed Systems by Tanenbaum
- etcd / Consul docs
```

## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| K8s | etcd (built-in) |
| Service discovery | Consul |
| Java ecosystem | ZooKeeper |
| Distributed lock | etcd / Redis Redlock |
| Cluster state | etcd / Consul |
| 작은 + 단순 | Redis lock |

## ❌ 안티패턴
- **2 node consensus**: no majority.
- **Even N**: same fault tolerance + 더 큰 quorum.
- **Cross-region single quorum**: write 매우 느림.
- **Disk full 무 monitoring**: leader stuck.
- **Backup 무**: snapshot lost = cluster lost.
- **모든 거 etcd**: high-throughput 안 적합.

## 🤖 LLM 활용 힌트
- 3 또는 5 node.
- Raft 가 modern.
- etcd / Consul = standard.
- Cross-region = 단일 region quorum + read replica.

## 🔗 관련 문서
- [[CS_Eventual_Consistency]]
- [[DB_Distributed_Locks]]
- [[Backend_Service_Discovery]]