[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,332 @@
|
||||
---
|
||||
id: mlops-model-monitoring
|
||||
title: ML Monitoring — drift / quality / SLO
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [mlops, monitoring, vibe-coding]
|
||||
tech_stack: { language: "Python", applicable_to: ["AI", "Backend"] }
|
||||
applied_in: []
|
||||
aliases: [ML monitoring, drift detection, data drift, concept drift, model decay, Evidently]
|
||||
---
|
||||
|
||||
# ML Monitoring
|
||||
|
||||
> Model 가 시간 따라 decay. **Data drift, concept drift, prediction drift, performance drop**. Evidently / Arize / Fiddler / WhyLabs.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Data drift: 입력 분포 변화.
|
||||
- Concept drift: 입력 → output 관계 변화.
|
||||
- Prediction drift: output 분포 변화.
|
||||
- Performance: ground truth 와 비교 (delay).
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### KS test (data drift)
|
||||
```python
|
||||
from scipy.stats import ks_2samp
|
||||
|
||||
ref = train_data['feature_x']
|
||||
prod = recent_data['feature_x']
|
||||
|
||||
stat, pval = ks_2samp(ref, prod)
|
||||
if pval < 0.05:
|
||||
alert(f'feature_x drift! p={pval:.3f}')
|
||||
```
|
||||
|
||||
→ 두 분포 다름 = drift.
|
||||
|
||||
### PSI (Population Stability Index)
|
||||
```python
|
||||
def psi(reference, current, bins=10):
|
||||
bins = np.linspace(reference.min(), reference.max(), bins + 1)
|
||||
ref_hist = np.histogram(reference, bins)[0] / len(reference)
|
||||
cur_hist = np.histogram(current, bins)[0] / len(current)
|
||||
|
||||
# Avoid log(0)
|
||||
ref_hist = np.where(ref_hist == 0, 0.0001, ref_hist)
|
||||
cur_hist = np.where(cur_hist == 0, 0.0001, cur_hist)
|
||||
|
||||
return np.sum((cur_hist - ref_hist) * np.log(cur_hist / ref_hist))
|
||||
|
||||
# < 0.1 = stable, 0.1-0.2 = some, > 0.2 = significant
|
||||
```
|
||||
|
||||
### Evidently (open source)
|
||||
```python
|
||||
from evidently.report import Report
|
||||
from evidently.metric_preset import DataDriftPreset, RegressionPreset
|
||||
|
||||
report = Report(metrics=[DataDriftPreset(), RegressionPreset()])
|
||||
report.run(reference_data=ref, current_data=prod)
|
||||
report.save_html('drift_report.html')
|
||||
```
|
||||
|
||||
→ Dashboard / drift detect / alert.
|
||||
|
||||
### Arize / WhyLabs (managed)
|
||||
```python
|
||||
import arize
|
||||
client = arize.Client(api_key=...)
|
||||
|
||||
client.log(
|
||||
model_id='churn',
|
||||
model_version='v3.1',
|
||||
prediction_id=pred_id,
|
||||
features=feat,
|
||||
prediction=pred,
|
||||
actual=actual, # 나중 도착
|
||||
)
|
||||
```
|
||||
|
||||
### Concept drift detection
|
||||
```python
|
||||
# Performance 가 시간 따라 ↓
|
||||
# rolling window accuracy
|
||||
def rolling_accuracy(predictions, actuals, window=1000):
|
||||
return [
|
||||
accuracy_score(actuals[i:i+window], predictions[i:i+window])
|
||||
for i in range(0, len(predictions) - window, 100)
|
||||
]
|
||||
|
||||
# Plot — 떨어지는 trend = drift
|
||||
```
|
||||
|
||||
### Prediction drift
|
||||
```python
|
||||
# Output 분포 추적
|
||||
prod_mean = recent_predictions.mean()
|
||||
prod_std = recent_predictions.std()
|
||||
ref_mean = train_predictions.mean()
|
||||
|
||||
if abs(prod_mean - ref_mean) > 2 * train_predictions.std():
|
||||
alert('prediction drift')
|
||||
```
|
||||
|
||||
### Latency / availability SLO
|
||||
```python
|
||||
# Prom metrics
|
||||
inference_latency = Histogram(
|
||||
'inference_latency_seconds',
|
||||
'Inference latency',
|
||||
['model'],
|
||||
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0],
|
||||
)
|
||||
|
||||
with inference_latency.labels(model='churn').time():
|
||||
pred = model.predict(features)
|
||||
```
|
||||
|
||||
→ p99 latency < 100ms 같은 SLO.
|
||||
|
||||
### Ground truth lag
|
||||
```
|
||||
Click prediction: 1 sec 후 OK
|
||||
Churn 7 days: 7 일 후 ground truth
|
||||
Loan default: 30 days+
|
||||
|
||||
→ 실시간 metric 가 안 됨. Proxy metric 사용.
|
||||
```
|
||||
|
||||
### Proxy metric
|
||||
```
|
||||
Click model:
|
||||
- 직접: actual click rate
|
||||
- Proxy: dwell time, scroll depth
|
||||
|
||||
LLM:
|
||||
- 직접: human eval
|
||||
- Proxy: thumbs up / down, regen rate
|
||||
```
|
||||
|
||||
### Outlier detection
|
||||
```python
|
||||
from sklearn.ensemble import IsolationForest
|
||||
|
||||
iforest = IsolationForest().fit(train_features)
|
||||
|
||||
# 매 inference
|
||||
anomaly_score = iforest.decision_function([features])
|
||||
if anomaly_score < -0.5:
|
||||
log.warn('outlier input', features=features)
|
||||
```
|
||||
|
||||
→ Train data 와 다른 input = warn.
|
||||
|
||||
### Feedback loop
|
||||
```python
|
||||
# User correction
|
||||
@app.post('/feedback')
|
||||
def feedback(prediction_id: str, correct: bool):
|
||||
db.update(prediction_id, actual=correct)
|
||||
|
||||
# Retrain trigger
|
||||
if recent_corrections.error_rate > 0.1:
|
||||
trigger_retrain()
|
||||
```
|
||||
|
||||
### Online evaluation (LLM)
|
||||
```python
|
||||
# Helicone / Langsmith / Promptfoo
|
||||
@trace
|
||||
def llm_call(prompt):
|
||||
return llm.complete(prompt)
|
||||
|
||||
# Auto: latency, cost, error
|
||||
# Manual: user thumbs up/down
|
||||
```
|
||||
|
||||
### Shadow deployment
|
||||
```python
|
||||
# Prod traffic → 둘 다 — old + new
|
||||
@app.post('/predict')
|
||||
def predict(features):
|
||||
pred_old = old_model.predict(features)
|
||||
|
||||
# Shadow
|
||||
asyncio.create_task(log_shadow(features, new_model.predict(features)))
|
||||
|
||||
return pred_old
|
||||
```
|
||||
|
||||
→ New model 가 안 사용 — but log 가 됨. 비교.
|
||||
|
||||
### A/B test
|
||||
```python
|
||||
def predict(features, user_id):
|
||||
if hash(user_id) % 100 < 10: # 10% B
|
||||
pred = new_model.predict(features)
|
||||
bucket = 'B'
|
||||
else:
|
||||
pred = old_model.predict(features)
|
||||
bucket = 'A'
|
||||
|
||||
log({'bucket': bucket, 'pred': pred})
|
||||
return pred
|
||||
```
|
||||
|
||||
→ Bucket 별 outcome (CTR, conversion) 비교.
|
||||
|
||||
### Cost
|
||||
```python
|
||||
# LLM
|
||||
import openai
|
||||
r = openai.chat.completions.create(...)
|
||||
cost = r.usage.total_tokens * 0.00001
|
||||
|
||||
prom_cost.labels(model='gpt-4').inc(cost)
|
||||
```
|
||||
|
||||
→ Per request cost 추적. Budget alert.
|
||||
|
||||
### Prompt 변경 추적
|
||||
```python
|
||||
# LangSmith / Helicone
|
||||
@traceable
|
||||
def chat(message: str, prompt_version: str = 'v3'):
|
||||
prompt = PROMPTS[prompt_version]
|
||||
return llm.complete(prompt + message)
|
||||
```
|
||||
|
||||
→ A/B prompt + outcome.
|
||||
|
||||
### Bias monitoring
|
||||
```python
|
||||
# Subgroup performance
|
||||
for group in ['gender', 'race', 'age_bucket']:
|
||||
for value in df[group].unique():
|
||||
subset = df[df[group] == value]
|
||||
acc = accuracy_score(subset.y, subset.pred)
|
||||
log({'group': group, 'value': value, 'acc': acc})
|
||||
|
||||
# Diff > 5% = alert
|
||||
```
|
||||
|
||||
### Model card update
|
||||
```markdown
|
||||
## Monitoring (live)
|
||||
|
||||
- Last update: 2026-05-09
|
||||
- Drift: stable (PSI 0.05)
|
||||
- Latency p99: 78ms
|
||||
- Error rate: 0.2%
|
||||
- Accuracy (last 7d): 0.86 (↓0.01 from baseline)
|
||||
```
|
||||
|
||||
### Retrain trigger
|
||||
```
|
||||
Trigger:
|
||||
- Drift > threshold
|
||||
- Performance drop > 5%
|
||||
- 매 N day
|
||||
- New data 양 > X
|
||||
|
||||
→ 자동 retrain pipeline (Airflow / Vertex / SageMaker).
|
||||
```
|
||||
|
||||
### LLM eval suite
|
||||
```python
|
||||
# Promptfoo / LangSmith
|
||||
tests = [
|
||||
{'input': 'What is 2+2?', 'expected': '4'},
|
||||
{'input': 'Capital of France?', 'expected': 'Paris'},
|
||||
]
|
||||
|
||||
for t in tests:
|
||||
actual = llm.complete(t['input'])
|
||||
pass_ = match(actual, t['expected'])
|
||||
log({'test': t, 'pass': pass_})
|
||||
```
|
||||
|
||||
→ Regression suite — 매 deploy.
|
||||
|
||||
### Production debugging
|
||||
```
|
||||
Bad prediction 발견:
|
||||
1. Input log — feature 가 outlier?
|
||||
2. Model version — recent change?
|
||||
3. Data pipeline — data 변경?
|
||||
4. 5W1H trace
|
||||
```
|
||||
|
||||
### Privacy
|
||||
```
|
||||
Log 가 PII 가 있을 수.
|
||||
- Hash / mask before log
|
||||
- Retention policy (30일 후 삭제)
|
||||
- GDPR / 사용자 삭제 요청
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 작업 | 추천 |
|
||||
|---|---|
|
||||
| Drift 감지 | PSI / KS test / Evidently |
|
||||
| Latency / cost | Prometheus + Grafana |
|
||||
| Performance lag | Proxy metric |
|
||||
| Compare new model | Shadow / A/B |
|
||||
| Bias | Subgroup analysis |
|
||||
| LLM | Helicone / LangSmith |
|
||||
| Auto retrain | Pipeline trigger |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **No monitoring**: silent decay.
|
||||
- **Offline metric 만**: prod 차이 모름.
|
||||
- **Ground truth 안 옴 = OK 가정**: 잘 못됨.
|
||||
- **Drift threshold 없음**: alert noise / miss.
|
||||
- **Subgroup 분석 안 함**: bias 잠재.
|
||||
- **Cost 추적 X**: 폭발.
|
||||
- **Retrain manual**: 늦어짐.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- PSI / KS = drift 표준 metric.
|
||||
- Shadow / A/B 가 안전한 deploy.
|
||||
- Proxy metric 가 lag 답.
|
||||
- Evidently / Arize / WhyLabs 가 ecosystem.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[MLOps_Model_Registry]]
|
||||
- [[AI_LLM_Eval_Patterns]]
|
||||
- [[Observability_RED_USE_Metrics]]
|
||||
Reference in New Issue
Block a user