Files
2nd/10_Wiki/Topics/Coding/DB_Replica_Operations.md
T
2026-05-09 21:08:02 +09:00

252 lines
6.2 KiB
Markdown

---
id: db-replica-operations
title: Replica 운영 — Streaming / Lag / Failover
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [database, postgres, replication, vibe-coding]
tech_stack: { language: "Postgres", applicable_to: ["Backend"] }
applied_in: []
aliases: [streaming replication, replication lag, failover, hot standby, Patroni, repmgr]
---
# Replica Operations
> Read replica 가 운영되려면 = **lag 모니터링 + failover 자동 + WAL retention 관리**. Patroni / repmgr / RDS / Aurora 가 자동.
## 📖 핵심 개념
- Streaming replication: WAL stream → standby.
- Synchronous: commit wait for replica (안전 + 느림).
- Asynchronous: primary 가 안 wait (보통).
- Hot standby: read 가능.
## 💻 코드 패턴
### Primary 설정
```
# postgresql.conf
wal_level = replica # 또는 logical
max_wal_senders = 10
wal_keep_size = 1GB # 또는 replication slot
hot_standby = on
# pg_hba.conf
host replication replicator <standby-ip>/32 md5
```
### Replication slot (WAL 보존)
```sql
SELECT pg_create_physical_replication_slot('standby1');
```
→ Standby 가 disconnected 되도 WAL 보존.
⚠️ Standby 가 영원 down → WAL 무한 누적. Drop unused slot.
### Standby setup
```bash
# pg_basebackup 으로 snapshot
pg_basebackup -h primary -D /var/lib/postgresql/data \
-U replicator -P -R -X stream -S standby1
# -R = standby.signal + primary_conninfo 자동
```
→ Standby 시작 시 streaming.
### Lag 모니터링
```sql
-- Primary 에서
SELECT
application_name, client_addr, state, sync_state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
EXTRACT(EPOCH FROM (NOW() - reply_time)) AS reply_seconds_ago
FROM pg_stat_replication;
-- Standby 에서
SELECT
pg_is_in_recovery(),
pg_last_wal_replay_lsn(),
NOW() - pg_last_xact_replay_timestamp() AS lag;
```
→ lag > 5s = warning, > 1min = critical.
### Lag alarm
```yaml
- alert: ReplicationLagHigh
expr: pg_replication_lag_seconds > 30
for: 2m
- alert: ReplicationStopped
expr: pg_replication_lag_seconds > 600
for: 1m
labels: { severity: critical }
```
### Failover (자동)
```
1. Primary 죽음
2. 자동 도구 (Patroni / repmgr) 가 detect
3. Standby 중 가장 진보한 것 promote
4. 다른 standby 가 새 primary 따라감
5. App 이 새 primary 발견 (DNS / VIP / pgbouncer)
```
### Patroni
```yaml
# patroni.yml
scope: postgres-cluster
namespace: /service/
restapi:
listen: 0.0.0.0:8008
etcd:
hosts: etcd1:2379, etcd2:2379, etcd3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB
postgresql:
listen: 0.0.0.0:5432
data_dir: /var/lib/postgresql/data
authentication:
replication:
username: replicator
password: ...
```
→ etcd / Consul 가 leader election.
### App 측 endpoint
```
patroni — REST API
GET /master → 현재 primary
GET /replica → standby
또는 HAProxy + Patroni health check
```
```ts
// App: connection — 자동 failover 친화
const writer = new Pool({ connectionString: 'postgresql://primary:5432/...' });
const reader = new Pool({ connectionString: 'postgresql://replica:5432/...' });
// 또는 단일 LB endpoint
const pool = new Pool({ connectionString: 'postgresql://lb-haproxy:5000/...' });
```
### Synchronous replication (선택)
```
# postgresql.conf
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
# 적어도 1 replica ack 까지 commit wait
```
→ 안전 ↑, latency ↑.
### Logical replication (다른 schema / 부분)
```sql
-- Primary
CREATE PUBLICATION app_pub FOR TABLE orders, users;
-- Subscriber
CREATE SUBSCRIPTION app_sub
CONNECTION 'host=primary user=replicator dbname=app'
PUBLICATION app_pub;
```
→ 다른 schema OK. Cross-version migration. Selective tables.
### Read-after-write (replica lag 우회)
```ts
// 같은 user 의 최근 write 후 read = primary
async function getOrders(userId) {
const recentWrite = await redis.get(`recent:${userId}`);
const db = recentWrite && Date.now() - recentWrite < 5000 ? primary : replica;
return db.query('SELECT * FROM orders WHERE user_id = $1', [userId]);
}
```
### Backup from replica
```bash
# Primary 영향 X
pg_basebackup -h replica -D backup/ -X stream
```
→ 큰 backup 가 primary 부하 X.
### Connection pool (PgBouncer / pgpool)
```
App → PgBouncer → Primary / Replica
- Connection multiplexing
- Routing (primary for write, replica for SELECT)
- 자동 reconnect on failover
```
```ini
# pgbouncer.ini
[databases]
app = host=primary port=5432 dbname=app
app_ro = host=replica port=5432 dbname=app
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
```
### RDS Multi-AZ vs Read Replica
```
Multi-AZ: 자동 failover, 같은 AZ 안. read 안 됨.
Read Replica: read 가능, failover 가능.
→ Production = Multi-AZ + Read Replica 같이.
```
### Cross-region replica
```
Primary (us-east-1)
└── Replica (us-east-1, sync) <- HA
└── Replica (eu-west-1, async) <- DR + read close
└── Replica (ap-northeast-1, async) <- read close
```
## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| HA | Multi-AZ + 자동 failover |
| Read 분산 | Read replica |
| DR | Cross-region replica |
| Cross-version migration | Logical replication |
| 부분 sync | Logical (publication) |
| Self-host | Patroni |
## ❌ 안티패턴
- **Lag 모니터링 X**: 1시간 lag 모름.
- **Slot drop 안 함 — old standby**: WAL 무한 누적.
- **Sync replication 단일 standby**: 죽으면 prod 멈춤.
- **App 직접 primary IP hardcode**: failover 시 cluster 깨짐.
- **Replica = backup 대체 가정**: 아님. backup 따로.
- **Read-after-write 무시**: 사용자가 자기 거 못 봄.
- **Failover 테스트 X**: 진짜 incident 시 실패.
## 🤖 LLM 활용 힌트
- Patroni + etcd + HAProxy = self-host HA.
- RDS Multi-AZ + Read Replica = managed.
- Lag alarm + slot 관리 + failover drill.
## 🔗 관련 문서
- [[DB_Read_Replica_Patterns]]
- [[DevOps_Disaster_Recovery]]
- [[DB_Change_Data_Capture]]