252 lines
6.2 KiB
Markdown
252 lines
6.2 KiB
Markdown
---
|
|
id: db-replica-operations
|
|
title: Replica 운영 — Streaming / Lag / Failover
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [database, postgres, replication, vibe-coding]
|
|
tech_stack: { language: "Postgres", applicable_to: ["Backend"] }
|
|
applied_in: []
|
|
aliases: [streaming replication, replication lag, failover, hot standby, Patroni, repmgr]
|
|
---
|
|
|
|
# Replica Operations
|
|
|
|
> Read replica 가 운영되려면 = **lag 모니터링 + failover 자동 + WAL retention 관리**. Patroni / repmgr / RDS / Aurora 가 자동.
|
|
|
|
## 📖 핵심 개념
|
|
- Streaming replication: WAL stream → standby.
|
|
- Synchronous: commit wait for replica (안전 + 느림).
|
|
- Asynchronous: primary 가 안 wait (보통).
|
|
- Hot standby: read 가능.
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### Primary 설정
|
|
```
|
|
# postgresql.conf
|
|
wal_level = replica # 또는 logical
|
|
max_wal_senders = 10
|
|
wal_keep_size = 1GB # 또는 replication slot
|
|
hot_standby = on
|
|
|
|
# pg_hba.conf
|
|
host replication replicator <standby-ip>/32 md5
|
|
```
|
|
|
|
### Replication slot (WAL 보존)
|
|
```sql
|
|
SELECT pg_create_physical_replication_slot('standby1');
|
|
```
|
|
|
|
→ Standby 가 disconnected 되도 WAL 보존.
|
|
|
|
⚠️ Standby 가 영원 down → WAL 무한 누적. Drop unused slot.
|
|
|
|
### Standby setup
|
|
```bash
|
|
# pg_basebackup 으로 snapshot
|
|
pg_basebackup -h primary -D /var/lib/postgresql/data \
|
|
-U replicator -P -R -X stream -S standby1
|
|
# -R = standby.signal + primary_conninfo 자동
|
|
```
|
|
|
|
→ Standby 시작 시 streaming.
|
|
|
|
### Lag 모니터링
|
|
```sql
|
|
-- Primary 에서
|
|
SELECT
|
|
application_name, client_addr, state, sync_state,
|
|
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag,
|
|
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
|
|
EXTRACT(EPOCH FROM (NOW() - reply_time)) AS reply_seconds_ago
|
|
FROM pg_stat_replication;
|
|
|
|
-- Standby 에서
|
|
SELECT
|
|
pg_is_in_recovery(),
|
|
pg_last_wal_replay_lsn(),
|
|
NOW() - pg_last_xact_replay_timestamp() AS lag;
|
|
```
|
|
|
|
→ lag > 5s = warning, > 1min = critical.
|
|
|
|
### Lag alarm
|
|
```yaml
|
|
- alert: ReplicationLagHigh
|
|
expr: pg_replication_lag_seconds > 30
|
|
for: 2m
|
|
|
|
- alert: ReplicationStopped
|
|
expr: pg_replication_lag_seconds > 600
|
|
for: 1m
|
|
labels: { severity: critical }
|
|
```
|
|
|
|
### Failover (자동)
|
|
```
|
|
1. Primary 죽음
|
|
2. 자동 도구 (Patroni / repmgr) 가 detect
|
|
3. Standby 중 가장 진보한 것 promote
|
|
4. 다른 standby 가 새 primary 따라감
|
|
5. App 이 새 primary 발견 (DNS / VIP / pgbouncer)
|
|
```
|
|
|
|
### Patroni
|
|
```yaml
|
|
# patroni.yml
|
|
scope: postgres-cluster
|
|
namespace: /service/
|
|
|
|
restapi:
|
|
listen: 0.0.0.0:8008
|
|
|
|
etcd:
|
|
hosts: etcd1:2379, etcd2:2379, etcd3:2379
|
|
|
|
bootstrap:
|
|
dcs:
|
|
ttl: 30
|
|
loop_wait: 10
|
|
retry_timeout: 10
|
|
maximum_lag_on_failover: 1048576 # 1MB
|
|
|
|
postgresql:
|
|
listen: 0.0.0.0:5432
|
|
data_dir: /var/lib/postgresql/data
|
|
authentication:
|
|
replication:
|
|
username: replicator
|
|
password: ...
|
|
```
|
|
|
|
→ etcd / Consul 가 leader election.
|
|
|
|
### App 측 endpoint
|
|
```
|
|
patroni — REST API
|
|
GET /master → 현재 primary
|
|
GET /replica → standby
|
|
|
|
또는 HAProxy + Patroni health check
|
|
```
|
|
|
|
```ts
|
|
// App: connection — 자동 failover 친화
|
|
const writer = new Pool({ connectionString: 'postgresql://primary:5432/...' });
|
|
const reader = new Pool({ connectionString: 'postgresql://replica:5432/...' });
|
|
|
|
// 또는 단일 LB endpoint
|
|
const pool = new Pool({ connectionString: 'postgresql://lb-haproxy:5000/...' });
|
|
```
|
|
|
|
### Synchronous replication (선택)
|
|
```
|
|
# postgresql.conf
|
|
synchronous_commit = on
|
|
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
|
|
# 적어도 1 replica ack 까지 commit wait
|
|
```
|
|
|
|
→ 안전 ↑, latency ↑.
|
|
|
|
### Logical replication (다른 schema / 부분)
|
|
```sql
|
|
-- Primary
|
|
CREATE PUBLICATION app_pub FOR TABLE orders, users;
|
|
|
|
-- Subscriber
|
|
CREATE SUBSCRIPTION app_sub
|
|
CONNECTION 'host=primary user=replicator dbname=app'
|
|
PUBLICATION app_pub;
|
|
```
|
|
|
|
→ 다른 schema OK. Cross-version migration. Selective tables.
|
|
|
|
### Read-after-write (replica lag 우회)
|
|
```ts
|
|
// 같은 user 의 최근 write 후 read = primary
|
|
async function getOrders(userId) {
|
|
const recentWrite = await redis.get(`recent:${userId}`);
|
|
const db = recentWrite && Date.now() - recentWrite < 5000 ? primary : replica;
|
|
return db.query('SELECT * FROM orders WHERE user_id = $1', [userId]);
|
|
}
|
|
```
|
|
|
|
### Backup from replica
|
|
```bash
|
|
# Primary 영향 X
|
|
pg_basebackup -h replica -D backup/ -X stream
|
|
```
|
|
|
|
→ 큰 backup 가 primary 부하 X.
|
|
|
|
### Connection pool (PgBouncer / pgpool)
|
|
```
|
|
App → PgBouncer → Primary / Replica
|
|
- Connection multiplexing
|
|
- Routing (primary for write, replica for SELECT)
|
|
- 자동 reconnect on failover
|
|
```
|
|
|
|
```ini
|
|
# pgbouncer.ini
|
|
[databases]
|
|
app = host=primary port=5432 dbname=app
|
|
app_ro = host=replica port=5432 dbname=app
|
|
|
|
[pgbouncer]
|
|
pool_mode = transaction
|
|
max_client_conn = 1000
|
|
default_pool_size = 25
|
|
```
|
|
|
|
### RDS Multi-AZ vs Read Replica
|
|
```
|
|
Multi-AZ: 자동 failover, 같은 AZ 안. read 안 됨.
|
|
Read Replica: read 가능, failover 가능.
|
|
|
|
→ Production = Multi-AZ + Read Replica 같이.
|
|
```
|
|
|
|
### Cross-region replica
|
|
```
|
|
Primary (us-east-1)
|
|
└── Replica (us-east-1, sync) <- HA
|
|
└── Replica (eu-west-1, async) <- DR + read close
|
|
└── Replica (ap-northeast-1, async) <- read close
|
|
```
|
|
|
|
## 🤔 의사결정 기준
|
|
| 상황 | 추천 |
|
|
|---|---|
|
|
| HA | Multi-AZ + 자동 failover |
|
|
| Read 분산 | Read replica |
|
|
| DR | Cross-region replica |
|
|
| Cross-version migration | Logical replication |
|
|
| 부분 sync | Logical (publication) |
|
|
| Self-host | Patroni |
|
|
|
|
## ❌ 안티패턴
|
|
- **Lag 모니터링 X**: 1시간 lag 모름.
|
|
- **Slot drop 안 함 — old standby**: WAL 무한 누적.
|
|
- **Sync replication 단일 standby**: 죽으면 prod 멈춤.
|
|
- **App 직접 primary IP hardcode**: failover 시 cluster 깨짐.
|
|
- **Replica = backup 대체 가정**: 아님. backup 따로.
|
|
- **Read-after-write 무시**: 사용자가 자기 거 못 봄.
|
|
- **Failover 테스트 X**: 진짜 incident 시 실패.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- Patroni + etcd + HAProxy = self-host HA.
|
|
- RDS Multi-AZ + Read Replica = managed.
|
|
- Lag alarm + slot 관리 + failover drill.
|
|
|
|
## 🔗 관련 문서
|
|
- [[DB_Read_Replica_Patterns]]
|
|
- [[DevOps_Disaster_Recovery]]
|
|
- [[DB_Change_Data_Capture]]
|