7.7 KiB
7.7 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| productivity-oncall-playbook | On-call Playbook — Rotation / Runbook / Escalation | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
On-call Playbook
24/7 service = 누군가 paging 받음. Rotation + runbook + escalation + alert quality. Burnout 방지 = clear rules + 정기 review.
📖 핵심 개념
- Primary / Secondary: 1차 + backup.
- Runbook: 어떤 alert → 무엇 할지.
- Escalation: 30 min 안 응답 X → 다음.
- Alert quality: actionable + 적은 noise.
💻 코드 패턴
Rotation 구조
Weekly rotation (기본):
- Mon 9am → Mon 9am (next week)
- Primary + Secondary
- Handoff meeting
또는:
- 일별 rotation (소규모 팀)
- Follow-the-sun (글로벌)
- 매 4주 rotation (피로 분산)
Compensation
- 시간 추가 (시급 1.5x)
- Comp time (오프 day)
- Annual bonus
- 또는 "voluntary + free meals"
→ 명시 정책. 무 compensation = burnout.
Runbook (per alert)
# Alert: HighErrorRate-API
**Severity:** P2
**Description:** API error rate > 5% for 5 minutes
## Investigate
1. Check Grafana dashboard: <link>
2. Check service logs (last 15 min):
kubectl logs -l app=api -n prod --since=15m | grep -i error
3. Check recent deploys:
kubectl rollout history deployment/api -n prod
4. Check upstream services (DB, Redis):
- DB: <Grafana DB dashboard>
- Redis: <Grafana Redis>
## Mitigate
1. If recent deploy (< 1 hour) → rollback:
kubectl rollout undo deployment/api -n prod
2. If DB pool exhaustion → scale up replicas:
kubectl scale deployment/api --replicas=10
3. If upstream down → enable maintenance mode:
curl -X POST $ADMIN/maintenance/enable
## Escalate
- DB issue → @db-team
- Network issue → @platform
- Security → @security
## Resolve
- Wait for error rate < 1% for 10 min
- Update incident channel
- Schedule postmortem
Alert quality (자주 잘못 잡힘 = 끄거나 fix)
좋은 alert:
✅ 사용자 영향 = 명확
✅ Actionable — 무엇 할지 명확
✅ 자주 paging X (false positive 적음)
✅ Severity 명시
나쁜 alert:
❌ Noise (CPU > 80% 5초 매번)
❌ Vague ("something is wrong")
❌ 자기 자신 — auto-resolve (if 자동 해결 = monitoring 만)
Page-worthy 정의
Pageable (P1/P2):
- 사용자 영향 큼 (5%+ users)
- Money loss
- Security
- SLO breach 임박
Non-pageable (P3/P4):
- 내부 도구 down
- 작은 batch job 실패
- Disk usage 50% (시간 OK)
→ 후자 = ticket / Slack 만.
Alert routing
# PagerDuty / Opsgenie / Grafana Oncall
# By service / by severity
services:
api:
primary: oncall-api-primary
severity: P1, P2
database:
primary: oncall-db
severity: P1
ml:
primary: oncall-ml
business_hours_only: true
24/7 vs business hours
모든 alert 가 24/7 X.
- API: 24/7
- ML training: business hours
- Internal tools: business hours
→ 새벽 alert = 진짜 critical 만.
Escalation policy
1. Primary (5 min ack)
↓ no ack
2. Secondary (5 min ack)
↓ no ack
3. Manager
↓ no ack
4. VP Eng
→ Auto escalate. 잠 들었어도 backup.
Handoff (rotation 끝)
# On-call handoff: 2026-05-09
## Open incidents
- [ ] DB lag investigation (Grafana <link>)
## Recent issues
- 14:00 API spike → resolved
## Pending
- Migration scheduled 2026-05-12
## Notes
- DB upgrade Tuesday — be careful
→ 30 min 미팅 또는 Slack post.
Secondary 의 역할
- Primary 가 응답 X → 자동 escalate
- Primary 가 deep work 시 — secondary 가 first response
- Big incident — 둘 다 work
Oncall 동안 work
- 큰 / 위험 task X (집중 어려움)
- Code review / 작은 task OK
- 미팅 줄이기
- Tools / runbook 개선 (장기 가치)
- Bug fix (오랫동안 미뤘던)
알림 도구
- Phone call (P1)
- SMS / Push notification (P2)
- Slack (P3)
- Email (P4)
→ Severity 별 채널.
Apps
- PagerDuty / Opsgenie — 표준
- Grafana Oncall — OSS, free
- Better Stack — modern
- FireHydrant / incident.io — incident management
Metrics (oncall health)
- Pages per shift (목표: < 3)
- After-hours pages (목표: < 1 / week)
- Resolution time (P1: < 1h)
- Action item completion rate
- Burnout survey (quarterly)
→ 너무 많은 page = process / system fix.
Page during sleep
1. Phone vibrate / loud
2. Paging app + emergency override
3. ACK 후 laptop / VPN
4. Slack channel join
5. Investigate + mitigate
6. Status page update
→ Recovery: 다음 day 늦게 시작 OK.
Tools 준비
- Laptop 가까이
- VPN auto-connect
- 핵심 dashboard bookmark
- kubectl context 저장
- Runbook offline 가능 (download)
- 빠른 SSH / kubectl access
Don't be a hero
복잡 incident — 혼자 X.
- Ping 추가 oncall
- Slack 에 도움 요청
- Manager 깨우기 (P1 / 큰 영향)
- 의사 결정 doc (chronological)
→ 잘못 결정 보다 도움 요청.
Post-incident
1. Status page resolve
2. Internal communication
3. Postmortem schedule (1 week)
4. Action items 생성
5. Recovery — 다음 day 늦게
Vacation / time-off
oncall 1주일 전 cover 합의:
- 다른 사람 swap
- Schedule 변경
→ Last-minute = 부담.
Onboarding new oncall
1. Shadow 2 rotations (with senior)
2. Handle 작은 incident with mentor
3. First solo rotation (paired with senior secondary)
4. Independent
→ Trust 구축. 갑자기 던지지 말 것.
Cross-team escalation
- DB team — DB issue
- Platform — K8s / infra
- Security — auth / leak
- Product — user-facing 결정
→ Slack channel 또는 Pager 직접.
"Quiet" rotation (good!)
0 page = good.
"이번 주 alarm 없었네" = system 가 안정.
→ 자랑 X — 자축.
Burnout signs
- 잠 못 잠
- 매 weekend 일
- Family time 없음
- 같은 alert 자주 (process 깨짐)
- Action items 무시
→ Manager 가 신호 — rotation 변경, hiring, 개선.
Improve oncall
- Quarterly retro (oncall team)
- Top noisy alert 5 = 매 quarter fix
- Runbook 갱신
- Auto-remediation (자주 alert 자동 처리)
Auto-remediation
// 예: high CPU → scale up 자동
if (alert === 'HighCPU' && service === 'api') {
await k8s.scale('api', currentReplicas * 2);
await slack.notify('Auto-scaled API to ${current * 2}');
}
→ Repetitive alert 자동화. Oncall 부담 ↓.
Page latency budget
SLO: P1 alert → 5 min ack, 30 min mitigate, 4 hour resolve.
매 quarter 측정 — 부족 시 process / tooling fix.
🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| 일반 web service | Weekly rotation + runbook |
| 작은 팀 (4 미만) | 외부 SaaS / 24/7 vendor |
| 큰 service | 24/7 + secondary + escalation |
| Internal tool | Business hours |
| Customer-facing | Strict SLO + paging |
❌ 안티패턴
- Hero culture (혼자 모두): burnout. Team work.
- Compensation 없음: burn / 떠남.
- Alert noise: Cry wolf — 진짜 alert 무시.
- Runbook 없음: 매번 처음부터 추측.
- Junior 혼자: shadow first.
- Rotation 너무 자주 (매 day): context loss.
- 너무 적음 (월 1): 쉬다가 jam.
🤖 LLM 활용 힌트
- Runbook = 알람마다 명시.
- Alert quality 의 회의 매 quarter.
- Compensation 명시.
- Burnout 모니터링.