"매 always there + 매 never forget". 매 availability = 매 즉시 응답 가능. 매 persistence (durability) = 매 한번 commit 의 절대 lose X. 매 distributed system 의 두 base. 매 SLA 의 currency.
📖 핵심
매 Availability (가용성)
매 system 의 의도된 service 의 가능 시간 비율.
매 measure: uptime / total time.
Nines
Downtime / year
99%
3.65 일
99.9% (3 nines)
8.76 시간
99.99% (4 nines)
52.6 분
99.999% (5 nines)
5.26 분
99.9999% (6 nines)
31.5 초
→ 매 nines 의 매 cost 의 exponential.
매 Durability (지속성)
매 commit 후 의 data 의 lose 의 probability.
매 S3: 11 nines (99.999999999%).
매 disk MTBF: 매 100 만 hour.
매 RPO / RTO
RPO (Recovery Point Objective): 매 잃을 수 있는 data 의 max age.
RTO (Recovery Time Objective): 매 service restore 까지의 max time.
RPO/RTO
매 strategy
0 / 0
매 sync replication, multi-region
min / min
매 hot standby
hour / hour
매 daily backup
day / day
매 cold backup
매 Availability 의 design
Redundancy
매 N+1 / N+2 (active-passive / active-active).
매 multi-AZ / multi-region.
매 load balancer + health check.
Fault tolerance
매 graceful degradation.
매 circuit breaker.
매 bulkhead.
매 retry with backoff.
Auto-recovery
매 self-healing (k8s).
매 auto-scaling.
매 chaos engineering 의 verify.
매 Persistence 의 design
ACID (RDBMS)
Atomicity: 매 all-or-nothing.
Consistency: 매 invariant 보존.
Isolation: 매 concurrent ↛ 매 interference.
Durability: 매 commit 의 persistent.
Replication
Sync: 매 N replica 의 ack 후 commit (latency cost).
Async: 매 leader commit 후 propagate (data loss risk).
Quorum (Paxos / Raft): 매 majority ack.
Backup
Full / incremental / differential.
3-2-1 rule: 3 copies, 2 different media, 1 offsite.
Test restore (매 critical, 매 자주 무시).
Storage tier
Hot (S3 Standard): 매 ms access.
Warm (Standard-IA): 매 cheaper, 매 retrieval fee.
Cold (Glacier): 매 hours retrieval.
Deep archive: 매 12 hour, 매 cheapest.
매 CAP / PACELC
CAP: Consistency + Availability + Partition tolerance — 매 2 만 pick.
-- 매 primary
ALTERSYSTEMSETsynchronous_standby_names='replica1, replica2';ALTERSYSTEMSETsynchronous_commit=on;SELECTpg_reload_conf();-- 매 replica 의 streaming replication 의 시작
-- 매 transaction 의 commit 의 매 replica ack 후.
S3 lifecycle (storage tier)
{"Rules":[{"Status":"Enabled","Transitions":[{"Days":30,"StorageClass":"STANDARD_IA"},{"Days":90,"StorageClass":"GLACIER"},{"Days":365,"StorageClass":"DEEP_ARCHIVE"}],"Expiration":{"Days":2555}// 7 years
}]}
SLO / Error budget
deferror_budget(sli_target=0.999,period_days=30):"""매 SLI 의 99.9% → 매 0.1% 의 error budget."""total_minutes=period_days*24*60budget=total_minutes*(1-sli_target)returnbudget# 매 분defburn_rate(actual_errors,budget,elapsed_fraction):expected=budget*elapsed_fractionreturnactual_errors/expectedifexpected>0else0# burn_rate > 1 → 매 budget 의 빠르게 burn.# burn_rate > 14.4 → 매 critical (1 hour 에 1 day budget).
Backup test restore
#!/bin/bash
# 매 매주 자동 restore testLATEST=$(aws s3 ls s3://backups/db/ | tail -1 | awk '{print $4}')
aws s3 cp "s3://backups/db/$LATEST" /tmp/
# 매 staging DB 의 restore
pg_restore -d staging_test /tmp/$LATEST# 매 sample query 의 verify
psql staging_test -c "SELECT count(*) FROM users;" > /tmp/result
diff /tmp/result expected.txt || alert "Backup restore failed!"
→ 매 backup 의 가치 = 매 restore 의 verify.
Multi-region failover (DNS)
# 매 Route53 health check + failover routing{'primary':{'region':'us-east-1','health_check':'http://primary/health'},'secondary':{'region':'us-west-2','health_check':'http://secondary/health'},'failover':'PRIMARY_FAILS_TO_SECONDARY',}
Distributed lock (Redis Redlock)
importredisimporttimeimportuuiddefacquire_lock(client,key,ttl=10000):token=str(uuid.uuid4())ifclient.set(key,token,nx=True,px=ttl):returntokenreturnNonedefrelease_lock(client,key,token):script="""
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
end
return 0
"""returnclient.eval(script,1,key,token)