6.2 KiB
6.2 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| db-replica-operations | Replica 운영 — Streaming / Lag / Failover | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Replica Operations
Read replica 가 운영되려면 = lag 모니터링 + failover 자동 + WAL retention 관리. Patroni / repmgr / RDS / Aurora 가 자동.
📖 핵심 개념
- Streaming replication: WAL stream → standby.
- Synchronous: commit wait for replica (안전 + 느림).
- Asynchronous: primary 가 안 wait (보통).
- Hot standby: read 가능.
💻 코드 패턴
Primary 설정
# postgresql.conf
wal_level = replica # 또는 logical
max_wal_senders = 10
wal_keep_size = 1GB # 또는 replication slot
hot_standby = on
# pg_hba.conf
host replication replicator <standby-ip>/32 md5
Replication slot (WAL 보존)
SELECT pg_create_physical_replication_slot('standby1');
→ Standby 가 disconnected 되도 WAL 보존.
⚠️ Standby 가 영원 down → WAL 무한 누적. Drop unused slot.
Standby setup
# pg_basebackup 으로 snapshot
pg_basebackup -h primary -D /var/lib/postgresql/data \
-U replicator -P -R -X stream -S standby1
# -R = standby.signal + primary_conninfo 자동
→ Standby 시작 시 streaming.
Lag 모니터링
-- Primary 에서
SELECT
application_name, client_addr, state, sync_state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
EXTRACT(EPOCH FROM (NOW() - reply_time)) AS reply_seconds_ago
FROM pg_stat_replication;
-- Standby 에서
SELECT
pg_is_in_recovery(),
pg_last_wal_replay_lsn(),
NOW() - pg_last_xact_replay_timestamp() AS lag;
→ lag > 5s = warning, > 1min = critical.
Lag alarm
- alert: ReplicationLagHigh
expr: pg_replication_lag_seconds > 30
for: 2m
- alert: ReplicationStopped
expr: pg_replication_lag_seconds > 600
for: 1m
labels: { severity: critical }
Failover (자동)
1. Primary 죽음
2. 자동 도구 (Patroni / repmgr) 가 detect
3. Standby 중 가장 진보한 것 promote
4. 다른 standby 가 새 primary 따라감
5. App 이 새 primary 발견 (DNS / VIP / pgbouncer)
Patroni
# patroni.yml
scope: postgres-cluster
namespace: /service/
restapi:
listen: 0.0.0.0:8008
etcd:
hosts: etcd1:2379, etcd2:2379, etcd3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB
postgresql:
listen: 0.0.0.0:5432
data_dir: /var/lib/postgresql/data
authentication:
replication:
username: replicator
password: ...
→ etcd / Consul 가 leader election.
App 측 endpoint
patroni — REST API
GET /master → 현재 primary
GET /replica → standby
또는 HAProxy + Patroni health check
// App: connection — 자동 failover 친화
const writer = new Pool({ connectionString: 'postgresql://primary:5432/...' });
const reader = new Pool({ connectionString: 'postgresql://replica:5432/...' });
// 또는 단일 LB endpoint
const pool = new Pool({ connectionString: 'postgresql://lb-haproxy:5000/...' });
Synchronous replication (선택)
# postgresql.conf
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
# 적어도 1 replica ack 까지 commit wait
→ 안전 ↑, latency ↑.
Logical replication (다른 schema / 부분)
-- Primary
CREATE PUBLICATION app_pub FOR TABLE orders, users;
-- Subscriber
CREATE SUBSCRIPTION app_sub
CONNECTION 'host=primary user=replicator dbname=app'
PUBLICATION app_pub;
→ 다른 schema OK. Cross-version migration. Selective tables.
Read-after-write (replica lag 우회)
// 같은 user 의 최근 write 후 read = primary
async function getOrders(userId) {
const recentWrite = await redis.get(`recent:${userId}`);
const db = recentWrite && Date.now() - recentWrite < 5000 ? primary : replica;
return db.query('SELECT * FROM orders WHERE user_id = $1', [userId]);
}
Backup from replica
# Primary 영향 X
pg_basebackup -h replica -D backup/ -X stream
→ 큰 backup 가 primary 부하 X.
Connection pool (PgBouncer / pgpool)
App → PgBouncer → Primary / Replica
- Connection multiplexing
- Routing (primary for write, replica for SELECT)
- 자동 reconnect on failover
# pgbouncer.ini
[databases]
app = host=primary port=5432 dbname=app
app_ro = host=replica port=5432 dbname=app
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
RDS Multi-AZ vs Read Replica
Multi-AZ: 자동 failover, 같은 AZ 안. read 안 됨.
Read Replica: read 가능, failover 가능.
→ Production = Multi-AZ + Read Replica 같이.
Cross-region replica
Primary (us-east-1)
└── Replica (us-east-1, sync) <- HA
└── Replica (eu-west-1, async) <- DR + read close
└── Replica (ap-northeast-1, async) <- read close
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| HA | Multi-AZ + 자동 failover |
| Read 분산 | Read replica |
| DR | Cross-region replica |
| Cross-version migration | Logical replication |
| 부분 sync | Logical (publication) |
| Self-host | Patroni |
❌ 안티패턴
- Lag 모니터링 X: 1시간 lag 모름.
- Slot drop 안 함 — old standby: WAL 무한 누적.
- Sync replication 단일 standby: 죽으면 prod 멈춤.
- App 직접 primary IP hardcode: failover 시 cluster 깨짐.
- Replica = backup 대체 가정: 아님. backup 따로.
- Read-after-write 무시: 사용자가 자기 거 못 봄.
- Failover 테스트 X: 진짜 incident 시 실패.
🤖 LLM 활용 힌트
- Patroni + etcd + HAProxy = self-host HA.
- RDS Multi-AZ + Read Replica = managed.
- Lag alarm + slot 관리 + failover drill.