--- id: wiki-2026-0508-spof title: SPOF (Single Point of Failure) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Single Point of Failure, SPoF] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [spof, reliability, ha, distributed-systems, sre] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Go framework: Kubernetes --- # SPOF (Single Point of Failure) ## 매 한 줄 > **"매 component 가 죽으면 매 system 전체가 죽는 의 단일 의존점"**. 매 reliability engineering 의 가장 기본 anti-pattern — 매 redundancy + replication + failover 로 제거. 매 2020s cloud era 에서도 매 BGP misconfig (Facebook 2021), Cloudflare control plane (2023), AWS us-east-1 (2024 repeats) 가 매 region/provider-level SPOF 의 dramatic 증명. ## 매 핵심 ### 매 Layers of SPOF - **Hardware**: single PSU, single NIC, single rack, single AZ. - **Network**: single ISP, single BGP route, single DNS provider. - **Software**: leader without standby, single DB primary, single secret store. - **Human**: bus-factor 1 — only one person knows the system. - **Vendor**: single cloud, single CDN, single auth provider. ### 매 Removal patterns - **Redundancy**: N+1, N+2, 2N for power/cooling. - **Replication**: multi-master (CRDT/Raft), multi-AZ DB. - **Failover**: active-passive, active-active, anycast. - **Bulkhead**: cell-based architecture, blast-radius limit. - **Graceful degradation**: read-only mode, stale cache fallback. ### 매 응용 1. Multi-AZ / multi-region cloud architecture. 2. Database HA (Patroni, RDS Multi-AZ, Spanner). 3. Multi-CDN / multi-DNS strategy. 4. Cell-based isolation (AWS Lambda, Slack). ## 💻 패턴 ### PostgreSQL HA with Patroni (Raft-based) ```yaml scope: prod-cluster namespace: /db/ name: pg-node1 restapi: listen: 0.0.0.0:8008 etcd3: hosts: etcd1:2379,etcd2:2379,etcd3:2379 postgresql: listen: 0.0.0.0:5432 data_dir: /var/lib/postgresql/data parameters: max_connections: 200 synchronous_commit: "on" synchronous_standby_names: "ANY 1 (*)" ``` ### Multi-region failover (Route53 health check) ```python import boto3 r53 = boto3.client("route53") r53.change_resource_record_sets(HostedZoneId=ZONE, ChangeBatch={ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "api.example.com", "Type": "A", "SetIdentifier": "us-east-1", "Failover": "PRIMARY", "AliasTarget": {"DNSName": NLB_EAST, "HostedZoneId": NLB_EAST_ZONE, "EvaluateTargetHealth": True}, "HealthCheckId": HC_EAST, }, }] }) ``` ### Circuit breaker (Go, sony/gobreaker) ```go import "github.com/sony/gobreaker" cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{ Name: "payment-svc", MaxRequests: 3, Timeout: 30 * time.Second, ReadyToTrip: func(c gobreaker.Counts) bool { return c.ConsecutiveFailures > 5 }, }) result, err := cb.Execute(func() (interface{}, error) { return paymentClient.Charge(ctx, req) }) ``` ### K8s pod anti-affinity (spread across zones) ```yaml apiVersion: apps/v1 kind: Deployment spec: template: spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: {app: api} ``` ### Multi-DNS (NS1 + Route53 anycast) ```python # Both providers serve same zone — survive provider outage (e.g. Dyn 2016) PROVIDERS = ["ns1.p01.dynect.net", "ns-2048.awsdns-64.com"] # Register both NS records at registrar; clients auto-fallback ``` ### Chaos test for SPOF discovery ```python # Chaos Mesh: kill random node, observe SLO apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos spec: action: pod-failure mode: one duration: "60s" selector: namespaces: [prod] scheduler: cron: "@every 1h" ``` ### CRDT for leaderless replication (Yjs) ```javascript import * as Y from 'yjs' import { WebsocketProvider } from 'y-websocket' const doc = new Y.Doc() // Multiple providers — no single broker SPOF new WebsocketProvider('wss://ws1.app', 'room', doc) new WebsocketProvider('wss://ws2.app', 'room', doc) const map = doc.getMap('state') ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 99.9% SLA | Multi-AZ, single region | | 99.99% SLA | Multi-region active-active | | 99.999% | Multi-cloud + multi-DNS + chaos engineering | | Stateful (DB) | Patroni / RDS Multi-AZ / Spanner | | Stateless | LB + auto-scale + anti-affinity | **기본값**: Multi-AZ active-active + circuit breakers + chaos drill 월 1회. ## 🔗 Graph - 부모: [[High-Availability]] - 응용: [[Disaster-Recovery]] · [[Multi-Region]] · [[Chaos-Engineering]] - Adjacent: [[CAP-Theorem]] · [[SRE]] ## 🤖 LLM 활용 **언제**: architecture review for SPOF spotting, postmortem analysis, runbook generation, dependency graph summarization. **언제 X**: real-time failover decisions — use deterministic health checks and orchestrators. ## ❌ 안티패턴 - **Hidden SPOF**: shared dependency (DNS, secrets manager, internal CA) buried 3 layers deep. - **DR untested**: passive standby never failed-over to → discover bit-rot at worst time. - **Multi-AZ ≠ multi-region**: AZ correlated failures (control plane, BGP) still happen. - **Human SPOF**: senior engineer leaves, no one knows the deploy script. ## 🧪 검증 / 중복 - Verified (Google SRE Book, AWS Well-Architected Reliability Pillar). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — SPOF layers, HA patterns, chaos engineering |