---
id: productivity-oncall-playbook
title: On-call Playbook — Rotation / Runbook / Escalation
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [productivity, oncall, sre, vibe-coding]
tech_stack: { language: "Process", applicable_to: ["Engineering"] }
applied_in: []
aliases: [oncall, on-call, runbook, escalation, paging, alert fatigue, rotation]
---
# On-call Playbook
> 24/7 service = 누군가 paging 받음. **Rotation + runbook + escalation + alert quality**. Burnout 방지 = clear rules + 정기 review.
## 📖 핵심 개념
- Primary / Secondary: 1차 + backup.
- Runbook: 어떤 alert → 무엇 할지.
- Escalation: 30 min 안 응답 X → 다음.
- Alert quality: actionable + 적은 noise.
## 💻 코드 패턴
### Rotation 구조
```
Weekly rotation (기본):
- Mon 9am → Mon 9am (next week)
- Primary + Secondary
- Handoff meeting
또는:
- 일별 rotation (소규모 팀)
- Follow-the-sun (글로벌)
- 매 4주 rotation (피로 분산)
```
### Compensation
```
- 시간 추가 (시급 1.5x)
- Comp time (오프 day)
- Annual bonus
- 또는 "voluntary + free meals"
→ 명시 정책. 무 compensation = burnout.
```
### Runbook (per alert)
```markdown
# Alert: HighErrorRate-API
**Severity:** P2
**Description:** API error rate > 5% for 5 minutes
## Investigate
1. Check Grafana dashboard:
2. Check service logs (last 15 min):
```
kubectl logs -l app=api -n prod --since=15m | grep -i error
```
3. Check recent deploys:
```
kubectl rollout history deployment/api -n prod
```
4. Check upstream services (DB, Redis):
- DB:
- Redis:
## Mitigate
1. If recent deploy (< 1 hour) → rollback:
```
kubectl rollout undo deployment/api -n prod
```
2. If DB pool exhaustion → scale up replicas:
```
kubectl scale deployment/api --replicas=10
```
3. If upstream down → enable maintenance mode:
```
curl -X POST $ADMIN/maintenance/enable
```
## Escalate
- DB issue → @db-team
- Network issue → @platform
- Security → @security
## Resolve
- Wait for error rate < 1% for 10 min
- Update incident channel
- Schedule postmortem
```
### Alert quality (자주 잘못 잡힘 = 끄거나 fix)
```
좋은 alert:
✅ 사용자 영향 = 명확
✅ Actionable — 무엇 할지 명확
✅ 자주 paging X (false positive 적음)
✅ Severity 명시
나쁜 alert:
❌ Noise (CPU > 80% 5초 매번)
❌ Vague ("something is wrong")
❌ 자기 자신 — auto-resolve (if 자동 해결 = monitoring 만)
```
### Page-worthy 정의
```
Pageable (P1/P2):
- 사용자 영향 큼 (5%+ users)
- Money loss
- Security
- SLO breach 임박
Non-pageable (P3/P4):
- 내부 도구 down
- 작은 batch job 실패
- Disk usage 50% (시간 OK)
→ 후자 = ticket / Slack 만.
```
### Alert routing
```yaml
# PagerDuty / Opsgenie / Grafana Oncall
# By service / by severity
services:
api:
primary: oncall-api-primary
severity: P1, P2
database:
primary: oncall-db
severity: P1
ml:
primary: oncall-ml
business_hours_only: true
```
### 24/7 vs business hours
```
모든 alert 가 24/7 X.
- API: 24/7
- ML training: business hours
- Internal tools: business hours
→ 새벽 alert = 진짜 critical 만.
```
### Escalation policy
```
1. Primary (5 min ack)
↓ no ack
2. Secondary (5 min ack)
↓ no ack
3. Manager
↓ no ack
4. VP Eng
```
→ Auto escalate. 잠 들었어도 backup.
### Handoff (rotation 끝)
```markdown
# On-call handoff: 2026-05-09
## Open incidents
- [ ] DB lag investigation (Grafana )
## Recent issues
- 14:00 API spike → resolved
## Pending
- Migration scheduled 2026-05-12
## Notes
- DB upgrade Tuesday — be careful
```
→ 30 min 미팅 또는 Slack post.
### Secondary 의 역할
```
- Primary 가 응답 X → 자동 escalate
- Primary 가 deep work 시 — secondary 가 first response
- Big incident — 둘 다 work
```
### Oncall 동안 work
```
- 큰 / 위험 task X (집중 어려움)
- Code review / 작은 task OK
- 미팅 줄이기
- Tools / runbook 개선 (장기 가치)
- Bug fix (오랫동안 미뤘던)
```
### 알림 도구
```
- Phone call (P1)
- SMS / Push notification (P2)
- Slack (P3)
- Email (P4)
→ Severity 별 채널.
```
### Apps
```
- PagerDuty / Opsgenie — 표준
- Grafana Oncall — OSS, free
- Better Stack — modern
- FireHydrant / incident.io — incident management
```
### Metrics (oncall health)
```
- Pages per shift (목표: < 3)
- After-hours pages (목표: < 1 / week)
- Resolution time (P1: < 1h)
- Action item completion rate
- Burnout survey (quarterly)
```
→ 너무 많은 page = process / system fix.
### Page during sleep
```
1. Phone vibrate / loud
2. Paging app + emergency override
3. ACK 후 laptop / VPN
4. Slack channel join
5. Investigate + mitigate
6. Status page update
→ Recovery: 다음 day 늦게 시작 OK.
```
### Tools 준비
```
- Laptop 가까이
- VPN auto-connect
- 핵심 dashboard bookmark
- kubectl context 저장
- Runbook offline 가능 (download)
- 빠른 SSH / kubectl access
```
### Don't be a hero
```
복잡 incident — 혼자 X.
- Ping 추가 oncall
- Slack 에 도움 요청
- Manager 깨우기 (P1 / 큰 영향)
- 의사 결정 doc (chronological)
→ 잘못 결정 보다 도움 요청.
```
### Post-incident
```
1. Status page resolve
2. Internal communication
3. Postmortem schedule (1 week)
4. Action items 생성
5. Recovery — 다음 day 늦게
```
### Vacation / time-off
```
oncall 1주일 전 cover 합의:
- 다른 사람 swap
- Schedule 변경
→ Last-minute = 부담.
```
### Onboarding new oncall
```
1. Shadow 2 rotations (with senior)
2. Handle 작은 incident with mentor
3. First solo rotation (paired with senior secondary)
4. Independent
→ Trust 구축. 갑자기 던지지 말 것.
```
### Cross-team escalation
```
- DB team — DB issue
- Platform — K8s / infra
- Security — auth / leak
- Product — user-facing 결정
→ Slack channel 또는 Pager 직접.
```
### "Quiet" rotation (good!)
```
0 page = good.
"이번 주 alarm 없었네" = system 가 안정.
→ 자랑 X — 자축.
```
### Burnout signs
```
- 잠 못 잠
- 매 weekend 일
- Family time 없음
- 같은 alert 자주 (process 깨짐)
- Action items 무시
→ Manager 가 신호 — rotation 변경, hiring, 개선.
```
### Improve oncall
```
- Quarterly retro (oncall team)
- Top noisy alert 5 = 매 quarter fix
- Runbook 갱신
- Auto-remediation (자주 alert 자동 처리)
```
### Auto-remediation
```ts
// 예: high CPU → scale up 자동
if (alert === 'HighCPU' && service === 'api') {
await k8s.scale('api', currentReplicas * 2);
await slack.notify('Auto-scaled API to ${current * 2}');
}
→ Repetitive alert 자동화. Oncall 부담 ↓.
```
### Page latency budget
```
SLO: P1 alert → 5 min ack, 30 min mitigate, 4 hour resolve.
매 quarter 측정 — 부족 시 process / tooling fix.
```
## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| 일반 web service | Weekly rotation + runbook |
| 작은 팀 (4 미만) | 외부 SaaS / 24/7 vendor |
| 큰 service | 24/7 + secondary + escalation |
| Internal tool | Business hours |
| Customer-facing | Strict SLO + paging |
## ❌ 안티패턴
- **Hero culture (혼자 모두)**: burnout. Team work.
- **Compensation 없음**: burn / 떠남.
- **Alert noise**: Cry wolf — 진짜 alert 무시.
- **Runbook 없음**: 매번 처음부터 추측.
- **Junior 혼자**: shadow first.
- **Rotation 너무 자주 (매 day)**: context loss.
- **너무 적음 (월 1)**: 쉬다가 jam.
## 🤖 LLM 활용 힌트
- Runbook = 알람마다 명시.
- Alert quality 의 회의 매 quarter.
- Compensation 명시.
- Burnout 모니터링.
## 🔗 관련 문서
- [[Productivity_Postmortem]]
- [[DevOps_Observability_Stack]]
- [[Backend_Health_Check_Patterns]]