Files
2nd/10_Wiki/Topics/Coding/CS_Distributed_Consensus.md
T
2026-05-09 22:47:42 +09:00

9.2 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
cs-distributed-consensus Distributed Consensus — Raft / Paxos / Leader Election Coding draft B conceptual 2026-05-09 2026-05-09
cs
distributed
consensus
vibe-coding
language applicable_to
Concept
Backend
Raft
Paxos
leader election
etcd
ZooKeeper
consensus
quorum

Distributed Consensus

N 노드 가 같은 결정 (leader, value). Raft (modern, understandable), Paxos (classic), Zab (ZooKeeper). etcd / Consul / ZooKeeper 가 implementation. CAP theorem.

📖 핵심 개념

  • Consensus: 모든 노드 가 같은 value agree.
  • Quorum: majority (N/2 + 1).
  • Leader election.
  • Log replication.

💻 코드 패턴

Why consensus

분산 system:
- 어떤 노드 가 primary?
- 어떤 value 가 latest?
- Configuration 변경 동의?

→ Consensus protocol 가 답.
Roles:
- Leader: write 받음
- Follower: leader 따름
- Candidate: leader 선출 중

Election:
1. Follower 가 leader heartbeat 안 들음 → candidate
2. Term++ + vote 자기 자신
3. RequestVote RPC 다른 노드
4. Majority vote → leader
5. AppendEntries (heartbeat) 시작

Log replication:
1. Client → leader
2. Leader 가 log 추가
3. AppendEntries → followers
4. Majority ack → committed
5. Leader 가 client respond + apply

→ "Understandable" Paxos.

Raft term

Term = monotonic counter.
매 election 가 새 term.

Term 0: 시작
Term 1: leader A
Term 2: leader B (A 가 죽음)
...

Quorum

N = 5 nodes.
Majority = 3.

Write quorum: 3 nodes commit
Read quorum: 1 (leader) 또는 모든 nodes (linearizable read)

→ Network partition 시 minority 가 work X.

CAP theorem

Consistency: 모든 노드 같은 value.
Availability: 응답 OK.
Partition tolerance: network partition 견딤.

→ Network partition 시 C 또는 A 둘 중.
CP:  ZooKeeper, etcd, MongoDB (default).
AP:  Cassandra, DynamoDB.
CA:  단일 노드 (no partition).

etcd (Raft, K8s 의 base)

# 3 node cluster
etcd \
  --name node1 \
  --listen-peer-urls http://10.0.0.1:2380 \
  --listen-client-urls http://10.0.0.1:2379 \
  --initial-advertise-peer-urls http://10.0.0.1:2380 \
  --initial-cluster node1=http://10.0.0.1:2380,node2=http://10.0.0.2:2380,node3=http://10.0.0.3:2380 \
  --initial-cluster-state new
import { Etcd3 } from 'etcd3';

const client = new Etcd3({ hosts: ['10.0.0.1:2379', '10.0.0.2:2379', '10.0.0.3:2379'] });

// Put
await client.put('/config/feature-x').value('enabled');

// Get
const value = await client.get('/config/feature-x').string();

// Watch
client.watch().key('/config/feature-x').create().then(watcher => {
  watcher.on('put', (v) => console.log('Changed:', v.value.toString()));
});

// Lease (TTL)
const lease = await client.lease(60);  // 60s
await lease.put('/services/my-app/instance-1').value('healthy');
// Auto delete after 60s without keepalive

→ K8s 의 cluster state. Service discovery.

Consul

import Consul from 'consul';

const consul = new Consul();

// KV
await consul.kv.set('config/feature-x', 'enabled');
const value = await consul.kv.get('config/feature-x');

// Service registration
await consul.agent.service.register({
  name: 'my-app',
  id: 'my-app-1',
  address: '10.0.0.1',
  port: 3000,
  check: {
    http: 'http://10.0.0.1:3000/health',
    interval: '10s',
  },
});

// Find service
const services = await consul.health.service('my-app');

→ Service discovery + KV. Multi-DC.

ZooKeeper (Zab)

# 3 node ZK ensemble.
# Java 기반 (older).

zkCli.sh
> create /myapp/config "value"
> get /myapp/config
> ls /myapp

→ Kafka, HBase, Hadoop 의 cluster coord.

Leader election (Raft / etcd)

import { Etcd3 } from 'etcd3';

const client = new Etcd3();
const election = client.election('my-leader');
const campaign = election.campaign('node-1');

campaign.on('elected', () => {
  console.log('I am leader');
  startLeaderWork();
});

campaign.on('error', (err) => {
  console.error(err);
});

→ 한 노드 만 leader. 나머지 follower.

Use case — 분산 cron

N 노드 의 cron job — 한 번만 실행:

1. Leader election
2. Leader 만 cron schedule
3. Leader 가 죽으면 → election

→ ZooKeeper / etcd / Redis lock.
async function tryBecomeLeader(): Promise<boolean> {
  return await election.campaign('cron-leader').then(() => true);
}

if (await tryBecomeLeader()) {
  scheduleCron();
}

Distributed lock (etcd / Redis)

// etcd 의 lock primitives
const lock = client.lock('my-resource');
await lock.acquire();
try {
  await doWork();
} finally {
  await lock.release();
}
// Redis (Redlock)
import Redlock from 'redlock';

const redlock = new Redlock([redisA, redisB, redisC]);
const resource = await redlock.acquire(['locks:my-resource'], 30_000);
try {
  await doWork();
} finally {
  await resource.release();
}

DB_Distributed_Locks.

Linearizability vs eventual

Linearizable: 외부 관찰 = 단일 노드 처럼.
- etcd, ZooKeeper
- Spanner

Eventual: 결국 같음.
- Cassandra
- DynamoDB

→ Trade-off. CP vs AP.

Two Generals / Byzantine

Two Generals: network 가 잃기 — agreement 어려움.
Byzantine: nodes 가 거짓 — 더 어려움.

Solutions:
- Raft / Paxos: 정직 노드 가정.
- BFT (Byzantine Fault Tolerance): adversarial 노드 — Bitcoin / Ethereum.
- HotStuff, Tendermint: modern BFT.

Bitcoin consensus (PoW)

Bitcoin = Byzantine consensus:
- 1 person = 1 hash (proof of work).
- Longest chain wins.
- Probabilistic finality (6 confirmation).

Energy 비싸 — Ethereum 가 PoS 로 이동.

Etcd vs Consul vs ZooKeeper

etcd:
+ K8s native
+ HTTP / gRPC
+ Modern
- 작은 (single purpose)

Consul:
+ Service discovery 강
+ Multi-DC
+ Health check
- 더 큰 dependency

ZooKeeper:
+ Mature (Hadoop / Kafka)
+ 매우 안정
- Java
- Less modern API

Cluster size

N = 2: 작동 X (no majority).
N = 3: 1 fail OK.
N = 5: 2 fail OK (큰 cluster 권장).
N = 7: 3 fail OK.

Even N (2, 4, 6) X — 같은 fault tolerance + 더 큰 quorum.

→ 보통 3 또는 5.

Multi-region (cross-DC)

ZooKeeper / etcd 가 latency 민감 (consensus 매 write).
Cross-region = 100ms+ — write 매우 느림.

해결:
- 단일 region quorum
- 다른 region = read replica (eventually consistent)

Operation

- Backup (regular snapshot)
- Disaster recovery (config restore)
- Monitoring (leader change, lag)
- Upgrade (rolling restart)
- Compaction (옛 log 정리)

Failure scenarios

1. Leader 죽음 → election (5-10s)
2. Network partition → minority 가 work X
3. All majority 죽음 → cluster down
4. Disk full → write fail
5. Clock skew → election issue

Real-world apps

K8s:        etcd
Consul:     service mesh / discovery
ZK:         Kafka, Hadoop, HBase
Apache Kafka: 자체 Raft (KRaft, 2024+)
CockroachDB: 자체 Raft
TiDB:       PD (자체 Raft)

Implementing Raft (학습)

Raft paper: https://raft.github.io
Visualization: https://thesecretlivesofdata.com/raft/

자체 implement = 학습 (production 에 안 쓰지 X).
hashicorp/raft (Go), MIT 6.824 lab.

When NOT to use

- Single node 충분 (작은 app)
- Stateless service (no consensus 필요)
- 단순 leader 만 — Redis lock 충분
- Strong consistency 안 필요 — eventual OK

Saga (consensus 가 아닌 alternative)

Distributed transaction:
- 2PC: blocking, slow
- Saga: compensating, fast

→ [[Backend_Saga_Patterns]].

Modern: KRaft (Kafka)

Kafka 가 ZooKeeper 의존 → KRaft (자체 Raft, 2024).
Single binary. 더 단순 ops.

Time

Leader election: 5-10s (default Raft).
Write commit: 1-10ms (single DC).
Cross-DC: 100ms+.

→ 빠른 = 같은 DC.

Use cases

✅ Service discovery
✅ Configuration store
✅ Leader election (distributed cron)
✅ Distributed lock
✅ Coordination (cluster size)
✅ K8s state

❌ High-throughput data (Cassandra)
❌ Big files (S3)
❌ Cache (Redis)

Failure tolerance

3 node etcd: 1 failure OK.
실제 3 fail = data loss 위험.

→ 3+ node 권장. 5 가 stable.

Learning resources

- Raft paper (raft.github.io)
- "The Secret Lives of Data" (visual)
- Designing Data-Intensive Applications (book)
- Distributed Systems by Tanenbaum
- etcd / Consul docs

🤔 의사결정 기준

작업 추천
K8s etcd (built-in)
Service discovery Consul
Java ecosystem ZooKeeper
Distributed lock etcd / Redis Redlock
Cluster state etcd / Consul
작은 + 단순 Redis lock

안티패턴

  • 2 node consensus: no majority.
  • Even N: same fault tolerance + 더 큰 quorum.
  • Cross-region single quorum: write 매우 느림.
  • Disk full 무 monitoring: leader stuck.
  • Backup 무: snapshot lost = cluster lost.
  • 모든 거 etcd: high-throughput 안 적합.

🤖 LLM 활용 힌트

  • 3 또는 5 node.
  • Raft 가 modern.
  • etcd / Consul = standard.
  • Cross-region = 단일 region quorum + read replica.

🔗 관련 문서