--- id: productivity-oncall-playbook title: On-call Playbook — Rotation / Runbook / Escalation category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [productivity, oncall, sre, vibe-coding] tech_stack: { language: "Process", applicable_to: ["Engineering"] } applied_in: [] aliases: [oncall, on-call, runbook, escalation, paging, alert fatigue, rotation] --- # On-call Playbook > 24/7 service = 누군가 paging 받음. **Rotation + runbook + escalation + alert quality**. Burnout 방지 = clear rules + 정기 review. ## 📖 핵심 개념 - Primary / Secondary: 1차 + backup. - Runbook: 어떤 alert → 무엇 할지. - Escalation: 30 min 안 응답 X → 다음. - Alert quality: actionable + 적은 noise. ## 💻 코드 패턴 ### Rotation 구조 ``` Weekly rotation (기본): - Mon 9am → Mon 9am (next week) - Primary + Secondary - Handoff meeting 또는: - 일별 rotation (소규모 팀) - Follow-the-sun (글로벌) - 매 4주 rotation (피로 분산) ``` ### Compensation ``` - 시간 추가 (시급 1.5x) - Comp time (오프 day) - Annual bonus - 또는 "voluntary + free meals" → 명시 정책. 무 compensation = burnout. ``` ### Runbook (per alert) ```markdown # Alert: HighErrorRate-API **Severity:** P2 **Description:** API error rate > 5% for 5 minutes ## Investigate 1. Check Grafana dashboard: 2. Check service logs (last 15 min): ``` kubectl logs -l app=api -n prod --since=15m | grep -i error ``` 3. Check recent deploys: ``` kubectl rollout history deployment/api -n prod ``` 4. Check upstream services (DB, Redis): - DB: - Redis: ## Mitigate 1. If recent deploy (< 1 hour) → rollback: ``` kubectl rollout undo deployment/api -n prod ``` 2. If DB pool exhaustion → scale up replicas: ``` kubectl scale deployment/api --replicas=10 ``` 3. If upstream down → enable maintenance mode: ``` curl -X POST $ADMIN/maintenance/enable ``` ## Escalate - DB issue → @db-team - Network issue → @platform - Security → @security ## Resolve - Wait for error rate < 1% for 10 min - Update incident channel - Schedule postmortem ``` ### Alert quality (자주 잘못 잡힘 = 끄거나 fix) ``` 좋은 alert: ✅ 사용자 영향 = 명확 ✅ Actionable — 무엇 할지 명확 ✅ 자주 paging X (false positive 적음) ✅ Severity 명시 나쁜 alert: ❌ Noise (CPU > 80% 5초 매번) ❌ Vague ("something is wrong") ❌ 자기 자신 — auto-resolve (if 자동 해결 = monitoring 만) ``` ### Page-worthy 정의 ``` Pageable (P1/P2): - 사용자 영향 큼 (5%+ users) - Money loss - Security - SLO breach 임박 Non-pageable (P3/P4): - 내부 도구 down - 작은 batch job 실패 - Disk usage 50% (시간 OK) → 후자 = ticket / Slack 만. ``` ### Alert routing ```yaml # PagerDuty / Opsgenie / Grafana Oncall # By service / by severity services: api: primary: oncall-api-primary severity: P1, P2 database: primary: oncall-db severity: P1 ml: primary: oncall-ml business_hours_only: true ``` ### 24/7 vs business hours ``` 모든 alert 가 24/7 X. - API: 24/7 - ML training: business hours - Internal tools: business hours → 새벽 alert = 진짜 critical 만. ``` ### Escalation policy ``` 1. Primary (5 min ack) ↓ no ack 2. Secondary (5 min ack) ↓ no ack 3. Manager ↓ no ack 4. VP Eng ``` → Auto escalate. 잠 들었어도 backup. ### Handoff (rotation 끝) ```markdown # On-call handoff: 2026-05-09 ## Open incidents - [ ] DB lag investigation (Grafana ) ## Recent issues - 14:00 API spike → resolved ## Pending - Migration scheduled 2026-05-12 ## Notes - DB upgrade Tuesday — be careful ``` → 30 min 미팅 또는 Slack post. ### Secondary 의 역할 ``` - Primary 가 응답 X → 자동 escalate - Primary 가 deep work 시 — secondary 가 first response - Big incident — 둘 다 work ``` ### Oncall 동안 work ``` - 큰 / 위험 task X (집중 어려움) - Code review / 작은 task OK - 미팅 줄이기 - Tools / runbook 개선 (장기 가치) - Bug fix (오랫동안 미뤘던) ``` ### 알림 도구 ``` - Phone call (P1) - SMS / Push notification (P2) - Slack (P3) - Email (P4) → Severity 별 채널. ``` ### Apps ``` - PagerDuty / Opsgenie — 표준 - Grafana Oncall — OSS, free - Better Stack — modern - FireHydrant / incident.io — incident management ``` ### Metrics (oncall health) ``` - Pages per shift (목표: < 3) - After-hours pages (목표: < 1 / week) - Resolution time (P1: < 1h) - Action item completion rate - Burnout survey (quarterly) ``` → 너무 많은 page = process / system fix. ### Page during sleep ``` 1. Phone vibrate / loud 2. Paging app + emergency override 3. ACK 후 laptop / VPN 4. Slack channel join 5. Investigate + mitigate 6. Status page update → Recovery: 다음 day 늦게 시작 OK. ``` ### Tools 준비 ``` - Laptop 가까이 - VPN auto-connect - 핵심 dashboard bookmark - kubectl context 저장 - Runbook offline 가능 (download) - 빠른 SSH / kubectl access ``` ### Don't be a hero ``` 복잡 incident — 혼자 X. - Ping 추가 oncall - Slack 에 도움 요청 - Manager 깨우기 (P1 / 큰 영향) - 의사 결정 doc (chronological) → 잘못 결정 보다 도움 요청. ``` ### Post-incident ``` 1. Status page resolve 2. Internal communication 3. Postmortem schedule (1 week) 4. Action items 생성 5. Recovery — 다음 day 늦게 ``` ### Vacation / time-off ``` oncall 1주일 전 cover 합의: - 다른 사람 swap - Schedule 변경 → Last-minute = 부담. ``` ### Onboarding new oncall ``` 1. Shadow 2 rotations (with senior) 2. Handle 작은 incident with mentor 3. First solo rotation (paired with senior secondary) 4. Independent → Trust 구축. 갑자기 던지지 말 것. ``` ### Cross-team escalation ``` - DB team — DB issue - Platform — K8s / infra - Security — auth / leak - Product — user-facing 결정 → Slack channel 또는 Pager 직접. ``` ### "Quiet" rotation (good!) ``` 0 page = good. "이번 주 alarm 없었네" = system 가 안정. → 자랑 X — 자축. ``` ### Burnout signs ``` - 잠 못 잠 - 매 weekend 일 - Family time 없음 - 같은 alert 자주 (process 깨짐) - Action items 무시 → Manager 가 신호 — rotation 변경, hiring, 개선. ``` ### Improve oncall ``` - Quarterly retro (oncall team) - Top noisy alert 5 = 매 quarter fix - Runbook 갱신 - Auto-remediation (자주 alert 자동 처리) ``` ### Auto-remediation ```ts // 예: high CPU → scale up 자동 if (alert === 'HighCPU' && service === 'api') { await k8s.scale('api', currentReplicas * 2); await slack.notify('Auto-scaled API to ${current * 2}'); } → Repetitive alert 자동화. Oncall 부담 ↓. ``` ### Page latency budget ``` SLO: P1 alert → 5 min ack, 30 min mitigate, 4 hour resolve. 매 quarter 측정 — 부족 시 process / tooling fix. ``` ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | 일반 web service | Weekly rotation + runbook | | 작은 팀 (4 미만) | 외부 SaaS / 24/7 vendor | | 큰 service | 24/7 + secondary + escalation | | Internal tool | Business hours | | Customer-facing | Strict SLO + paging | ## ❌ 안티패턴 - **Hero culture (혼자 모두)**: burnout. Team work. - **Compensation 없음**: burn / 떠남. - **Alert noise**: Cry wolf — 진짜 alert 무시. - **Runbook 없음**: 매번 처음부터 추측. - **Junior 혼자**: shadow first. - **Rotation 너무 자주 (매 day)**: context loss. - **너무 적음 (월 1)**: 쉬다가 jam. ## 🤖 LLM 활용 힌트 - Runbook = 알람마다 명시. - Alert quality 의 회의 매 quarter. - Compensation 명시. - Burnout 모니터링. ## 🔗 관련 문서 - [[Productivity_Postmortem]] - [[DevOps_Observability_Stack]] - [[Backend_Health_Check_Patterns]]