Files
2nd/10_Wiki/Topics/AI_and_ML/SPOF.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.6 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-spof SPOF (Single Point of Failure) 10_Wiki/Topics verified self
Single Point of Failure
SPoF
none A 0.9 applied
spof
reliability
ha
distributed-systems
sre
2026-05-10 pending
language framework
Go Kubernetes

SPOF (Single Point of Failure)

매 한 줄

"매 component 가 죽으면 매 system 전체가 죽는 의 단일 의존점". 매 reliability engineering 의 가장 기본 anti-pattern — 매 redundancy + replication + failover 로 제거. 매 2020s cloud era 에서도 매 BGP misconfig (Facebook 2021), Cloudflare control plane (2023), AWS us-east-1 (2024 repeats) 가 매 region/provider-level SPOF 의 dramatic 증명.

매 핵심

매 Layers of SPOF

  • Hardware: single PSU, single NIC, single rack, single AZ.
  • Network: single ISP, single BGP route, single DNS provider.
  • Software: leader without standby, single DB primary, single secret store.
  • Human: bus-factor 1 — only one person knows the system.
  • Vendor: single cloud, single CDN, single auth provider.

매 Removal patterns

  • Redundancy: N+1, N+2, 2N for power/cooling.
  • Replication: multi-master (CRDT/Raft), multi-AZ DB.
  • Failover: active-passive, active-active, anycast.
  • Bulkhead: cell-based architecture, blast-radius limit.
  • Graceful degradation: read-only mode, stale cache fallback.

매 응용

  1. Multi-AZ / multi-region cloud architecture.
  2. Database HA (Patroni, RDS Multi-AZ, Spanner).
  3. Multi-CDN / multi-DNS strategy.
  4. Cell-based isolation (AWS Lambda, Slack).

💻 패턴

PostgreSQL HA with Patroni (Raft-based)

scope: prod-cluster
namespace: /db/
name: pg-node1
restapi:
  listen: 0.0.0.0:8008
etcd3:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379
postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/data
  parameters:
    max_connections: 200
    synchronous_commit: "on"
    synchronous_standby_names: "ANY 1 (*)"

Multi-region failover (Route53 health check)

import boto3

r53 = boto3.client("route53")
r53.change_resource_record_sets(HostedZoneId=ZONE, ChangeBatch={
    "Changes": [{
        "Action": "UPSERT",
        "ResourceRecordSet": {
            "Name": "api.example.com",
            "Type": "A",
            "SetIdentifier": "us-east-1",
            "Failover": "PRIMARY",
            "AliasTarget": {"DNSName": NLB_EAST, "HostedZoneId": NLB_EAST_ZONE,
                           "EvaluateTargetHealth": True},
            "HealthCheckId": HC_EAST,
        },
    }]
})

Circuit breaker (Go, sony/gobreaker)

import "github.com/sony/gobreaker"

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name: "payment-svc",
    MaxRequests: 3,
    Timeout: 30 * time.Second,
    ReadyToTrip: func(c gobreaker.Counts) bool {
        return c.ConsecutiveFailures > 5
    },
})

result, err := cb.Execute(func() (interface{}, error) {
    return paymentClient.Charge(ctx, req)
})

K8s pod anti-affinity (spread across zones)

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels: {app: api}

Multi-DNS (NS1 + Route53 anycast)

# Both providers serve same zone — survive provider outage (e.g. Dyn 2016)
PROVIDERS = ["ns1.p01.dynect.net", "ns-2048.awsdns-64.com"]
# Register both NS records at registrar; clients auto-fallback

Chaos test for SPOF discovery

# Chaos Mesh: kill random node, observe SLO
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
spec:
  action: pod-failure
  mode: one
  duration: "60s"
  selector:
    namespaces: [prod]
  scheduler:
    cron: "@every 1h"

CRDT for leaderless replication (Yjs)

import * as Y from 'yjs'
import { WebsocketProvider } from 'y-websocket'

const doc = new Y.Doc()
// Multiple providers — no single broker SPOF
new WebsocketProvider('wss://ws1.app', 'room', doc)
new WebsocketProvider('wss://ws2.app', 'room', doc)
const map = doc.getMap('state')

매 결정 기준

상황 Approach
99.9% SLA Multi-AZ, single region
99.99% SLA Multi-region active-active
99.999% Multi-cloud + multi-DNS + chaos engineering
Stateful (DB) Patroni / RDS Multi-AZ / Spanner
Stateless LB + auto-scale + anti-affinity

기본값: Multi-AZ active-active + circuit breakers + chaos drill 월 1회.

🔗 Graph

🤖 LLM 활용

언제: architecture review for SPOF spotting, postmortem analysis, runbook generation, dependency graph summarization. 언제 X: real-time failover decisions — use deterministic health checks and orchestrators.

안티패턴

  • Hidden SPOF: shared dependency (DNS, secrets manager, internal CA) buried 3 layers deep.
  • DR untested: passive standby never failed-over to → discover bit-rot at worst time.
  • Multi-AZ ≠ multi-region: AZ correlated failures (control plane, BGP) still happen.
  • Human SPOF: senior engineer leaves, no one knows the deploy script.

🧪 검증 / 중복

  • Verified (Google SRE Book, AWS Well-Architected Reliability Pillar).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — SPOF layers, HA patterns, chaos engineering