Files
2nd/10_Wiki/Topics/Coding/Productivity_Oncall_Playbook.md
T
2026-05-09 21:08:02 +09:00

7.7 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
productivity-oncall-playbook On-call Playbook — Rotation / Runbook / Escalation Coding draft B conceptual 2026-05-09 2026-05-09
productivity
oncall
sre
vibe-coding
language applicable_to
Process
Engineering
oncall
on-call
runbook
escalation
paging
alert fatigue
rotation

On-call Playbook

24/7 service = 누군가 paging 받음. Rotation + runbook + escalation + alert quality. Burnout 방지 = clear rules + 정기 review.

📖 핵심 개념

  • Primary / Secondary: 1차 + backup.
  • Runbook: 어떤 alert → 무엇 할지.
  • Escalation: 30 min 안 응답 X → 다음.
  • Alert quality: actionable + 적은 noise.

💻 코드 패턴

Rotation 구조

Weekly rotation (기본):
- Mon 9am → Mon 9am (next week)
- Primary + Secondary
- Handoff meeting

또는:
- 일별 rotation (소규모 팀)
- Follow-the-sun (글로벌)
- 매 4주 rotation (피로 분산)

Compensation

- 시간 추가 (시급 1.5x)
- Comp time (오프 day)
- Annual bonus
- 또는 "voluntary + free meals"

→ 명시 정책. 무 compensation = burnout.

Runbook (per alert)

# Alert: HighErrorRate-API

**Severity:** P2
**Description:** API error rate > 5% for 5 minutes

## Investigate
1. Check Grafana dashboard: <link>
2. Check service logs (last 15 min):

kubectl logs -l app=api -n prod --since=15m | grep -i error

3. Check recent deploys:

kubectl rollout history deployment/api -n prod

4. Check upstream services (DB, Redis):
- DB: <Grafana DB dashboard>
- Redis: <Grafana Redis>

## Mitigate
1. If recent deploy (< 1 hour) → rollback:

kubectl rollout undo deployment/api -n prod

2. If DB pool exhaustion → scale up replicas:

kubectl scale deployment/api --replicas=10

3. If upstream down → enable maintenance mode:

curl -X POST $ADMIN/maintenance/enable


## Escalate
- DB issue → @db-team
- Network issue → @platform
- Security → @security

## Resolve
- Wait for error rate < 1% for 10 min
- Update incident channel
- Schedule postmortem

Alert quality (자주 잘못 잡힘 = 끄거나 fix)

좋은 alert:
✅ 사용자 영향 = 명확
✅ Actionable — 무엇 할지 명확
✅ 자주 paging X (false positive 적음)
✅ Severity 명시

나쁜 alert:
❌ Noise (CPU > 80% 5초 매번)
❌ Vague ("something is wrong")
❌ 자기 자신 — auto-resolve (if 자동 해결 = monitoring 만)

Page-worthy 정의

Pageable (P1/P2):
- 사용자 영향 큼 (5%+ users)
- Money loss
- Security
- SLO breach 임박

Non-pageable (P3/P4):
- 내부 도구 down
- 작은 batch job 실패
- Disk usage 50% (시간 OK)

→ 후자 = ticket / Slack 만.

Alert routing

# PagerDuty / Opsgenie / Grafana Oncall
# By service / by severity

services:
  api:
    primary: oncall-api-primary
    severity: P1, P2
  database:
    primary: oncall-db
    severity: P1
  ml:
    primary: oncall-ml
    business_hours_only: true

24/7 vs business hours

모든 alert 가 24/7 X.
- API: 24/7
- ML training: business hours
- Internal tools: business hours

→ 새벽 alert = 진짜 critical 만.

Escalation policy

1. Primary (5 min ack)
   ↓ no ack
2. Secondary (5 min ack)
   ↓ no ack
3. Manager
   ↓ no ack
4. VP Eng

→ Auto escalate. 잠 들었어도 backup.

Handoff (rotation 끝)

# On-call handoff: 2026-05-09

## Open incidents
- [ ] DB lag investigation (Grafana <link>)

## Recent issues
- 14:00 API spike → resolved

## Pending
- Migration scheduled 2026-05-12

## Notes
- DB upgrade Tuesday — be careful

→ 30 min 미팅 또는 Slack post.

Secondary 의 역할

- Primary 가 응답 X → 자동 escalate
- Primary 가 deep work 시 — secondary 가 first response
- Big incident — 둘 다 work

Oncall 동안 work

- 큰 / 위험 task X (집중 어려움)
- Code review / 작은 task OK
- 미팅 줄이기
- Tools / runbook 개선 (장기 가치)
- Bug fix (오랫동안 미뤘던)

알림 도구

- Phone call (P1)
- SMS / Push notification (P2)
- Slack (P3)
- Email (P4)

→ Severity 별 채널.

Apps

- PagerDuty / Opsgenie — 표준
- Grafana Oncall — OSS, free
- Better Stack — modern
- FireHydrant / incident.io — incident management

Metrics (oncall health)

- Pages per shift (목표: < 3)
- After-hours pages (목표: < 1 / week)
- Resolution time (P1: < 1h)
- Action item completion rate
- Burnout survey (quarterly)

→ 너무 많은 page = process / system fix.

Page during sleep

1. Phone vibrate / loud
2. Paging app + emergency override
3. ACK 후 laptop / VPN
4. Slack channel join
5. Investigate + mitigate
6. Status page update

→ Recovery: 다음 day 늦게 시작 OK.

Tools 준비

- Laptop 가까이
- VPN auto-connect
- 핵심 dashboard bookmark
- kubectl context 저장
- Runbook offline 가능 (download)
- 빠른 SSH / kubectl access

Don't be a hero

복잡 incident — 혼자 X.
- Ping 추가 oncall
- Slack 에 도움 요청
- Manager 깨우기 (P1 / 큰 영향)
- 의사 결정 doc (chronological)

→ 잘못 결정 보다 도움 요청.

Post-incident

1. Status page resolve
2. Internal communication
3. Postmortem schedule (1 week)
4. Action items 생성
5. Recovery — 다음 day 늦게

Vacation / time-off

oncall 1주일 전 cover 합의:
- 다른 사람 swap
- Schedule 변경

→ Last-minute = 부담.

Onboarding new oncall

1. Shadow 2 rotations (with senior)
2. Handle 작은 incident with mentor
3. First solo rotation (paired with senior secondary)
4. Independent

→ Trust 구축. 갑자기 던지지 말 것.

Cross-team escalation

- DB team — DB issue
- Platform — K8s / infra
- Security — auth / leak
- Product — user-facing 결정

→ Slack channel 또는 Pager 직접.

"Quiet" rotation (good!)

0 page = good.
"이번 주 alarm 없었네" = system 가 안정.

→ 자랑 X — 자축.

Burnout signs

- 잠 못 잠
- 매 weekend 일
- Family time 없음
- 같은 alert 자주 (process 깨짐)
- Action items 무시

→ Manager 가 신호 — rotation 변경, hiring, 개선.

Improve oncall

- Quarterly retro (oncall team)
- Top noisy alert 5 = 매 quarter fix
- Runbook 갱신
- Auto-remediation (자주 alert 자동 처리)

Auto-remediation

// 예: high CPU → scale up 자동
if (alert === 'HighCPU' && service === 'api') {
  await k8s.scale('api', currentReplicas * 2);
  await slack.notify('Auto-scaled API to ${current * 2}');
}

 Repetitive alert 자동화. Oncall 부담 .

Page latency budget

SLO: P1 alert → 5 min ack, 30 min mitigate, 4 hour resolve.

매 quarter 측정 — 부족 시 process / tooling fix.

🤔 의사결정 기준

작업 추천
일반 web service Weekly rotation + runbook
작은 팀 (4 미만) 외부 SaaS / 24/7 vendor
큰 service 24/7 + secondary + escalation
Internal tool Business hours
Customer-facing Strict SLO + paging

안티패턴

  • Hero culture (혼자 모두): burnout. Team work.
  • Compensation 없음: burn / 떠남.
  • Alert noise: Cry wolf — 진짜 alert 무시.
  • Runbook 없음: 매번 처음부터 추측.
  • Junior 혼자: shadow first.
  • Rotation 너무 자주 (매 day): context loss.
  • 너무 적음 (월 1): 쉬다가 jam.

🤖 LLM 활용 힌트

  • Runbook = 알람마다 명시.
  • Alert quality 의 회의 매 quarter.
  • Compensation 명시.
  • Burnout 모니터링.

🔗 관련 문서