Files

T

Antigravity Agent 93ec7e9056 [G1-Sync] Manual knowledge update

2026-05-09 21:08:02 +09:00

6.2 KiB

Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases

title

Replica Operations

Read replica 가 운영되려면 = lag 모니터링 + failover 자동 + WAL retention 관리. Patroni / repmgr / RDS / Aurora 가 자동.

📖 핵심 개념

Streaming replication: WAL stream → standby.
Synchronous: commit wait for replica (안전 + 느림).
Asynchronous: primary 가 안 wait (보통).
Hot standby: read 가능.

💻 코드 패턴

Primary 설정

# postgresql.conf
wal_level = replica  # 또는 logical
max_wal_senders = 10
wal_keep_size = 1GB    # 또는 replication slot
hot_standby = on

# pg_hba.conf
host replication replicator <standby-ip>/32 md5

Replication slot (WAL 보존)

SELECT pg_create_physical_replication_slot('standby1');

→ Standby 가 disconnected 되도 WAL 보존.

⚠️ Standby 가 영원 down → WAL 무한 누적. Drop unused slot.

Standby setup

# pg_basebackup 으로 snapshot
pg_basebackup -h primary -D /var/lib/postgresql/data \
  -U replicator -P -R -X stream -S standby1
# -R = standby.signal + primary_conninfo 자동

→ Standby 시작 시 streaming.

Lag 모니터링

-- Primary 에서
SELECT
  application_name, client_addr, state, sync_state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
  EXTRACT(EPOCH FROM (NOW() - reply_time)) AS reply_seconds_ago
FROM pg_stat_replication;

-- Standby 에서
SELECT
  pg_is_in_recovery(),
  pg_last_wal_replay_lsn(),
  NOW() - pg_last_xact_replay_timestamp() AS lag;

→ lag > 5s = warning, > 1min = critical.

Lag alarm

- alert: ReplicationLagHigh
  expr: pg_replication_lag_seconds > 30
  for: 2m

- alert: ReplicationStopped
  expr: pg_replication_lag_seconds > 600
  for: 1m
  labels: { severity: critical }

Failover (자동)

1. Primary 죽음
2. 자동 도구 (Patroni / repmgr) 가 detect
3. Standby 중 가장 진보한 것 promote
4. 다른 standby 가 새 primary 따라감
5. App 이 새 primary 발견 (DNS / VIP / pgbouncer)

Patroni

# patroni.yml
scope: postgres-cluster
namespace: /service/

restapi:
  listen: 0.0.0.0:8008

etcd:
  hosts: etcd1:2379, etcd2:2379, etcd3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB

postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replicator
      password: ...

→ etcd / Consul 가 leader election.

App 측 endpoint

patroni — REST API
GET /master   → 현재 primary
GET /replica  → standby

또는 HAProxy + Patroni health check

// App: connection — 자동 failover 친화
const writer = new Pool({ connectionString: 'postgresql://primary:5432/...' });
const reader = new Pool({ connectionString: 'postgresql://replica:5432/...' });

// 또는 단일 LB endpoint
const pool = new Pool({ connectionString: 'postgresql://lb-haproxy:5000/...' });

Synchronous replication (선택)

# postgresql.conf
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
# 적어도 1 replica ack 까지 commit wait

→ 안전 ↑, latency ↑.

Logical replication (다른 schema / 부분)

-- Primary
CREATE PUBLICATION app_pub FOR TABLE orders, users;

-- Subscriber
CREATE SUBSCRIPTION app_sub
CONNECTION 'host=primary user=replicator dbname=app'
PUBLICATION app_pub;

→ 다른 schema OK. Cross-version migration. Selective tables.

Read-after-write (replica lag 우회)

// 같은 user 의 최근 write 후 read = primary
async function getOrders(userId) {
  const recentWrite = await redis.get(`recent:${userId}`);
  const db = recentWrite && Date.now() - recentWrite < 5000 ? primary : replica;
  return db.query('SELECT * FROM orders WHERE user_id = $1', [userId]);
}

Backup from replica

# Primary 영향 X
pg_basebackup -h replica -D backup/ -X stream

→ 큰 backup 가 primary 부하 X.

Connection pool (PgBouncer / pgpool)

App → PgBouncer → Primary / Replica
- Connection multiplexing
- Routing (primary for write, replica for SELECT)
- 자동 reconnect on failover

# pgbouncer.ini
[databases]
app = host=primary port=5432 dbname=app
app_ro = host=replica port=5432 dbname=app

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

RDS Multi-AZ vs Read Replica

Multi-AZ: 자동 failover, 같은 AZ 안. read 안 됨.
Read Replica: read 가능, failover 가능.

→ Production = Multi-AZ + Read Replica 같이.

Cross-region replica

Primary (us-east-1)
└── Replica (us-east-1, sync)         <- HA
└── Replica (eu-west-1, async)         <- DR + read close
└── Replica (ap-northeast-1, async)    <- read close

🤔 의사결정 기준

상황	추천
HA	Multi-AZ + 자동 failover
Read 분산	Read replica
DR	Cross-region replica
Cross-version migration	Logical replication
부분 sync	Logical (publication)
Self-host	Patroni

❌ 안티패턴

Lag 모니터링 X: 1시간 lag 모름.
Slot drop 안 함 — old standby: WAL 무한 누적.
Sync replication 단일 standby: 죽으면 prod 멈춤.
App 직접 primary IP hardcode: failover 시 cluster 깨짐.
Replica = backup 대체 가정: 아님. backup 따로.
Read-after-write 무시: 사용자가 자기 거 못 봄.
Failover 테스트 X: 진짜 incident 시 실패.

🤖 LLM 활용 힌트

Patroni + etcd + HAProxy = self-host HA.
RDS Multi-AZ + Read Replica = managed.
Lag alarm + slot 관리 + failover drill.

6.2 KiB Raw Blame History