--- id: db-replica-operations title: Replica 운영 — Streaming / Lag / Failover category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [database, postgres, replication, vibe-coding] tech_stack: { language: "Postgres", applicable_to: ["Backend"] } applied_in: [] aliases: [streaming replication, replication lag, failover, hot standby, Patroni, repmgr] --- # Replica Operations > Read replica 가 운영되려면 = **lag 모니터링 + failover 자동 + WAL retention 관리**. Patroni / repmgr / RDS / Aurora 가 자동. ## 📖 핵심 개념 - Streaming replication: WAL stream → standby. - Synchronous: commit wait for replica (안전 + 느림). - Asynchronous: primary 가 안 wait (보통). - Hot standby: read 가능. ## 💻 코드 패턴 ### Primary 설정 ``` # postgresql.conf wal_level = replica # 또는 logical max_wal_senders = 10 wal_keep_size = 1GB # 또는 replication slot hot_standby = on # pg_hba.conf host replication replicator /32 md5 ``` ### Replication slot (WAL 보존) ```sql SELECT pg_create_physical_replication_slot('standby1'); ``` → Standby 가 disconnected 되도 WAL 보존. ⚠️ Standby 가 영원 down → WAL 무한 누적. Drop unused slot. ### Standby setup ```bash # pg_basebackup 으로 snapshot pg_basebackup -h primary -D /var/lib/postgresql/data \ -U replicator -P -R -X stream -S standby1 # -R = standby.signal + primary_conninfo 자동 ``` → Standby 시작 시 streaming. ### Lag 모니터링 ```sql -- Primary 에서 SELECT application_name, client_addr, state, sync_state, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag, EXTRACT(EPOCH FROM (NOW() - reply_time)) AS reply_seconds_ago FROM pg_stat_replication; -- Standby 에서 SELECT pg_is_in_recovery(), pg_last_wal_replay_lsn(), NOW() - pg_last_xact_replay_timestamp() AS lag; ``` → lag > 5s = warning, > 1min = critical. ### Lag alarm ```yaml - alert: ReplicationLagHigh expr: pg_replication_lag_seconds > 30 for: 2m - alert: ReplicationStopped expr: pg_replication_lag_seconds > 600 for: 1m labels: { severity: critical } ``` ### Failover (자동) ``` 1. Primary 죽음 2. 자동 도구 (Patroni / repmgr) 가 detect 3. Standby 중 가장 진보한 것 promote 4. 다른 standby 가 새 primary 따라감 5. App 이 새 primary 발견 (DNS / VIP / pgbouncer) ``` ### Patroni ```yaml # patroni.yml scope: postgres-cluster namespace: /service/ restapi: listen: 0.0.0.0:8008 etcd: hosts: etcd1:2379, etcd2:2379, etcd3:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1MB postgresql: listen: 0.0.0.0:5432 data_dir: /var/lib/postgresql/data authentication: replication: username: replicator password: ... ``` → etcd / Consul 가 leader election. ### App 측 endpoint ``` patroni — REST API GET /master → 현재 primary GET /replica → standby 또는 HAProxy + Patroni health check ``` ```ts // App: connection — 자동 failover 친화 const writer = new Pool({ connectionString: 'postgresql://primary:5432/...' }); const reader = new Pool({ connectionString: 'postgresql://replica:5432/...' }); // 또는 단일 LB endpoint const pool = new Pool({ connectionString: 'postgresql://lb-haproxy:5000/...' }); ``` ### Synchronous replication (선택) ``` # postgresql.conf synchronous_commit = on synchronous_standby_names = 'ANY 1 (standby1, standby2)' # 적어도 1 replica ack 까지 commit wait ``` → 안전 ↑, latency ↑. ### Logical replication (다른 schema / 부분) ```sql -- Primary CREATE PUBLICATION app_pub FOR TABLE orders, users; -- Subscriber CREATE SUBSCRIPTION app_sub CONNECTION 'host=primary user=replicator dbname=app' PUBLICATION app_pub; ``` → 다른 schema OK. Cross-version migration. Selective tables. ### Read-after-write (replica lag 우회) ```ts // 같은 user 의 최근 write 후 read = primary async function getOrders(userId) { const recentWrite = await redis.get(`recent:${userId}`); const db = recentWrite && Date.now() - recentWrite < 5000 ? primary : replica; return db.query('SELECT * FROM orders WHERE user_id = $1', [userId]); } ``` ### Backup from replica ```bash # Primary 영향 X pg_basebackup -h replica -D backup/ -X stream ``` → 큰 backup 가 primary 부하 X. ### Connection pool (PgBouncer / pgpool) ``` App → PgBouncer → Primary / Replica - Connection multiplexing - Routing (primary for write, replica for SELECT) - 자동 reconnect on failover ``` ```ini # pgbouncer.ini [databases] app = host=primary port=5432 dbname=app app_ro = host=replica port=5432 dbname=app [pgbouncer] pool_mode = transaction max_client_conn = 1000 default_pool_size = 25 ``` ### RDS Multi-AZ vs Read Replica ``` Multi-AZ: 자동 failover, 같은 AZ 안. read 안 됨. Read Replica: read 가능, failover 가능. → Production = Multi-AZ + Read Replica 같이. ``` ### Cross-region replica ``` Primary (us-east-1) └── Replica (us-east-1, sync) <- HA └── Replica (eu-west-1, async) <- DR + read close └── Replica (ap-northeast-1, async) <- read close ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | HA | Multi-AZ + 자동 failover | | Read 분산 | Read replica | | DR | Cross-region replica | | Cross-version migration | Logical replication | | 부분 sync | Logical (publication) | | Self-host | Patroni | ## ❌ 안티패턴 - **Lag 모니터링 X**: 1시간 lag 모름. - **Slot drop 안 함 — old standby**: WAL 무한 누적. - **Sync replication 단일 standby**: 죽으면 prod 멈춤. - **App 직접 primary IP hardcode**: failover 시 cluster 깨짐. - **Replica = backup 대체 가정**: 아님. backup 따로. - **Read-after-write 무시**: 사용자가 자기 거 못 봄. - **Failover 테스트 X**: 진짜 incident 시 실패. ## 🤖 LLM 활용 힌트 - Patroni + etcd + HAProxy = self-host HA. - RDS Multi-AZ + Read Replica = managed. - Lag alarm + slot 관리 + failover drill. ## 🔗 관련 문서 - [[DB_Read_Replica_Patterns]] - [[DevOps_Disaster_Recovery]] - [[DB_Change_Data_Capture]]