Files
2nd/10_Wiki/Topics/Coding/DevOps_Argo_Rollouts.md
T
2026-05-10 22:08:15 +09:00

391 lines
8.0 KiB
Markdown

---
id: devops-argo-rollouts
title: Argo Rollouts — Canary / Blue-Green deploy
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [devops, deployment, vibe-coding]
tech_stack: { language: "YAML", applicable_to: ["DevOps"] }
applied_in: []
aliases: [Argo Rollouts, canary, blue-green, progressive delivery, Flagger, AnalysisRun]
---
# Argo Rollouts
> K8s Deployment 가 rolling 만 — 정밀 control X. **Argo Rollouts: canary / blue-green / experiment**. Auto rollback (metric 기반).
## 📖 핵심 개념
- Canary: 1% → 10% → 100%.
- Blue-green: 두 version, 한 번에 swap.
- Analysis: Prometheus / Datadog metric 기반 promote / abort.
- Service mesh + Argo = traffic shifting.
## 💻 코드 패턴
### Rollout (Deployment 대신)
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: my-app
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 20
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
template:
spec:
containers:
- name: app
image: myapp:v2
```
→ kubectl 가 자동 promote / rollback.
### Manual promote
```bash
kubectl argo rollouts get rollout my-app
# → Visual progress
kubectl argo rollouts promote my-app
kubectl argo rollouts abort my-app
```
### Pause + manual
```yaml
strategy:
canary:
steps:
- setWeight: 10
- pause: {} # 무한 — manual promote 까지
- setWeight: 100
```
→ Production 첫 deploy = manual approve.
### Blue-green
```yaml
strategy:
blueGreen:
activeService: my-app-active
previewService: my-app-preview
autoPromotionEnabled: false
```
```
1. New ReplicaSet 만 (preview).
2. preview service 가 새 version.
3. Test / verify.
4. Promote = active service 가 새 version.
5. 옛 version 가 idle (rollback 가능).
```
### Analysis (Prometheus)
```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.example.com
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
```
```yaml
spec:
strategy:
canary:
steps:
- setWeight: 20
- pause: { duration: 1m }
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-app
- setWeight: 50
```
→ Success rate < 95% = abort (rollback).
### Traffic management (Istio)
```yaml
strategy:
canary:
canaryService: my-app-canary
stableService: my-app-stable
trafficRouting:
istio:
virtualServices:
- name: my-app-vsvc
destinationRule:
name: my-app-destrule
canarySubsetName: canary
stableSubsetName: stable
steps:
- setWeight: 5
- pause: { duration: 10m }
- setWeight: 25
```
→ Istio 가 weighted routing.
### Header-based routing
```yaml
steps:
- setHeaderRoute:
name: beta-route
match:
- headerName: X-Canary
headerValue:
exact: "true"
- pause: {} # beta 사용자 만 v2
- setWeight: 50
```
→ "Beta" header 가진 user 만 canary.
### NGINX / ALB ingress
```yaml
trafficRouting:
nginx:
stableIngress: my-app-stable-ingress
annotationPrefix: nginx.ingress.kubernetes.io
```
→ Service mesh 없이도.
### Experiment (long-running A/B)
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Experiment
metadata:
name: my-experiment
spec:
duration: 1h
templates:
- name: baseline
replicas: 1
template: ...
- name: canary
replicas: 1
template: ...
analyses:
- name: success-rate
templateName: success-rate
```
→ 1 시간 실행, metric 비교.
### Flagger (alternative)
```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: my-app
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
analysis:
interval: 1m
threshold: 5
iterations: 10
metrics:
- name: request-success-rate
thresholdRange: { min: 99 }
```
→ Flux / Helm 친화. Argo Rollouts 와 비슷.
### Rollback
```bash
kubectl argo rollouts undo my-app
# 또는 spec 의 image 옛 version 으로 revert
```
→ 이전 ReplicaSet 가 active.
### Auto-rollback (metric)
```yaml
spec:
strategy:
canary:
steps:
- setWeight: 10
- analysis: { templates: [{templateName: error-rate}] }
# error-rate fail = automatic rollback
```
→ 사람 없이 도 안전.
### Multiple analysis
```yaml
analysis:
templates:
- templateName: success-rate
- templateName: latency-p99
- templateName: error-rate
```
→ 모두 pass = promote.
### Web push (alarm)
```yaml
metrics:
- name: success-rate
successCondition: result[0] >= 0.95
failureCondition: result[0] < 0.9
failureLimit: 3
inconclusiveLimit: 5 # rate 가 metric 모름 = inconclusive
```
→ 명시적 fail / inconclusive.
### Web hook (외부 system)
```yaml
metrics:
- name: web-test
provider:
web:
url: https://my-api.example.com/health
jsonPath: '{$.status}'
method: GET
successCondition: result == "healthy"
```
### Notification (Slack)
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
...
notifications:
onAbort:
- slack
onSuccess:
- slack
```
→ Promote / abort 시 Slack 알림.
### GitOps (ArgoCD + Argo Rollouts)
```
1. Push new image tag to git.
2. ArgoCD sync = Rollout spec update.
3. Rollout 가 canary 시작.
4. Metric pass = promote.
5. Fail = auto rollback (git revert 안 함, K8s level).
```
→ 매 deploy 가 progressive.
### Cost / overhead
```
- 매 canary 가 추가 replica (50% extra during rollout)
- Metric query 가 cluster cost
- Engineering 시간
→ 매 deploy 가 큰 risk = 가치.
```
### Real-world
- **Intuit** (Argo 의 owner)
- **Adobe**: 큰 Argo 사용
- **GitHub**: 비슷한 internal
- **Spotify**: Flagger
- **모든 SaaS**: progressive delivery 어떻든
### When NOT?
```
- 작은 internal tool: rolling deploy 충분.
- Stateful: blue-green 어려움 (DB).
- Cron / batch job: canary 의미 X.
→ Critical path API / web 가 sweet spot.
```
### Stateful 의 함정
```
DB schema 변경:
- v1 + v2 가 동시 = schema 가 둘 다 호환.
- Backward compatible migration 필수.
→ "expand-contract":
1. 새 column 추가 (v1 OK).
2. v2 가 새 column 사용.
3. v1 retire.
4. 옛 column 삭제.
```
### Header-based testing
```
QA team 가 header 추가 → canary 만 사용.
"X-Canary: true" → v2 만 받음.
→ Production traffic 0% 의 진짜 canary.
```
### LaunchDarkly + Argo
```
Feature flag (LD) + 점진 rollout (Argo).
- Argo: 새 version 의 traffic %.
- LD: 새 feature 의 user %.
→ 둘 다 layer.
```
## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 큰 traffic | Canary + analysis |
| Critical | Blue-green |
| Beta / A/B | Experiment |
| GitOps | ArgoCD + Rollouts |
| Flux | Flagger |
| Service mesh 있음 | Istio + Argo |
| 작은 system | Helm rolling |
## ❌ 안티패턴
- **Auto-promote 만 + analysis 없음**: 위험.
- **첫 deploy 가 100%**: pause + manual.
- **DB schema breaking + canary**: data 깨짐.
- **Metric query 가 too narrow**: false signal.
- **Manual promote 만**: 사람 없이 안 됨.
- **Rollback test 없음**: 진짜 안 됨.
- **Resource limit 없음**: canary 가 cluster 죽임.
## 🤖 LLM 활용 힌트
- Canary + metric analysis 가 modern progressive.
- Blue-green 가 stateful 가 어려움.
- ArgoCD + Argo Rollouts 가 GitOps + delivery.
- Flagger 가 alternative.
## 🔗 관련 문서
- [[DevOps_Deployment_Strategies]]
- [[DevOps_ArgoCD_GitOps]]
- [[DevOps_Service_Mesh_Deep]]