[G1-Sync] Manual knowledge update

2026-05-09 22:47:42 +09:00
parent 93ec7e9056
commit 21ac3ed255
56 changed files with 22043 additions and 43 deletions
@@ -0,0 +1,332 @@
+---
+id: mlops-model-monitoring
+title: ML Monitoring — drift / quality / SLO
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [mlops, monitoring, vibe-coding]
+tech_stack: { language: "Python", applicable_to: ["AI", "Backend"] }
+applied_in: []
+aliases: [ML monitoring, drift detection, data drift, concept drift, model decay, Evidently]
+---
+
+# ML Monitoring
+
+> Model 가 시간 따라 decay. **Data drift, concept drift, prediction drift, performance drop**. Evidently / Arize / Fiddler / WhyLabs.
+
+## 📖 핵심 개념
+- Data drift: 입력 분포 변화.
+- Concept drift: 입력 → output 관계 변화.
+- Prediction drift: output 분포 변화.
+- Performance: ground truth 와 비교 (delay).
+
+## 💻 코드 패턴
+
+### KS test (data drift)
+```python
+from scipy.stats import ks_2samp
+
+ref = train_data['feature_x']
+prod = recent_data['feature_x']
+
+stat, pval = ks_2samp(ref, prod)
+if pval < 0.05:
+    alert(f'feature_x drift! p={pval:.3f}')
+```
+
+→ 두 분포 다름 = drift.
+
+### PSI (Population Stability Index)
+```python
+def psi(reference, current, bins=10):
+    bins = np.linspace(reference.min(), reference.max(), bins + 1)
+    ref_hist = np.histogram(reference, bins)[0] / len(reference)
+    cur_hist = np.histogram(current, bins)[0] / len(current)
+    
+    # Avoid log(0)
+    ref_hist = np.where(ref_hist == 0, 0.0001, ref_hist)
+    cur_hist = np.where(cur_hist == 0, 0.0001, cur_hist)
+    
+    return np.sum((cur_hist - ref_hist) * np.log(cur_hist / ref_hist))
+
+# < 0.1 = stable, 0.1-0.2 = some, > 0.2 = significant
+```
+
+### Evidently (open source)
+```python
+from evidently.report import Report
+from evidently.metric_preset import DataDriftPreset, RegressionPreset
+
+report = Report(metrics=[DataDriftPreset(), RegressionPreset()])
+report.run(reference_data=ref, current_data=prod)
+report.save_html('drift_report.html')
+```
+
+→ Dashboard / drift detect / alert.
+
+### Arize / WhyLabs (managed)
+```python
+import arize
+client = arize.Client(api_key=...)
+
+client.log(
+    model_id='churn',
+    model_version='v3.1',
+    prediction_id=pred_id,
+    features=feat,
+    prediction=pred,
+    actual=actual,  # 나중 도착
+)
+```
+
+### Concept drift detection
+```python
+# Performance 가 시간 따라 ↓
+# rolling window accuracy
+def rolling_accuracy(predictions, actuals, window=1000):
+    return [
+        accuracy_score(actuals[i:i+window], predictions[i:i+window])
+        for i in range(0, len(predictions) - window, 100)
+    ]
+
+# Plot — 떨어지는 trend = drift
+```
+
+### Prediction drift
+```python
+# Output 분포 추적
+prod_mean = recent_predictions.mean()
+prod_std = recent_predictions.std()
+ref_mean = train_predictions.mean()
+
+if abs(prod_mean - ref_mean) > 2 * train_predictions.std():
+    alert('prediction drift')
+```
+
+### Latency / availability SLO
+```python
+# Prom metrics
+inference_latency = Histogram(
+    'inference_latency_seconds',
+    'Inference latency',
+    ['model'],
+    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0],
+)
+
+with inference_latency.labels(model='churn').time():
+    pred = model.predict(features)
+```
+
+→ p99 latency < 100ms 같은 SLO.
+
+### Ground truth lag
+```
+Click prediction: 1 sec 후 OK
+Churn 7 days: 7 일 후 ground truth
+Loan default: 30 days+
+
+→ 실시간 metric 가 안 됨. Proxy metric 사용.
+```
+
+### Proxy metric
+```
+Click model:
+- 직접: actual click rate
+- Proxy: dwell time, scroll depth
+
+LLM:
+- 직접: human eval
+- Proxy: thumbs up / down, regen rate
+```
+
+### Outlier detection
+```python
+from sklearn.ensemble import IsolationForest
+
+iforest = IsolationForest().fit(train_features)
+
+# 매 inference
+anomaly_score = iforest.decision_function([features])
+if anomaly_score < -0.5:
+    log.warn('outlier input', features=features)
+```
+
+→ Train data 와 다른 input = warn.
+
+### Feedback loop
+```python
+# User correction
+@app.post('/feedback')
+def feedback(prediction_id: str, correct: bool):
+    db.update(prediction_id, actual=correct)
+    
+    # Retrain trigger
+    if recent_corrections.error_rate > 0.1:
+        trigger_retrain()
+```
+
+### Online evaluation (LLM)
+```python
+# Helicone / Langsmith / Promptfoo
+@trace
+def llm_call(prompt):
+    return llm.complete(prompt)
+
+# Auto: latency, cost, error
+# Manual: user thumbs up/down
+```
+
+### Shadow deployment
+```python
+# Prod traffic → 둘 다 — old + new
+@app.post('/predict')
+def predict(features):
+    pred_old = old_model.predict(features)
+    
+    # Shadow
+    asyncio.create_task(log_shadow(features, new_model.predict(features)))
+    
+    return pred_old
+```
+
+→ New model 가 안 사용 — but log 가 됨. 비교.
+
+### A/B test
+```python
+def predict(features, user_id):
+    if hash(user_id) % 100 < 10:  # 10% B
+        pred = new_model.predict(features)
+        bucket = 'B'
+    else:
+        pred = old_model.predict(features)
+        bucket = 'A'
+    
+    log({'bucket': bucket, 'pred': pred})
+    return pred
+```
+
+→ Bucket 별 outcome (CTR, conversion) 비교.
+
+### Cost
+```python
+# LLM
+import openai
+r = openai.chat.completions.create(...)
+cost = r.usage.total_tokens * 0.00001
+
+prom_cost.labels(model='gpt-4').inc(cost)
+```
+
+→ Per request cost 추적. Budget alert.
+
+### Prompt 변경 추적
+```python
+# LangSmith / Helicone
+@traceable
+def chat(message: str, prompt_version: str = 'v3'):
+    prompt = PROMPTS[prompt_version]
+    return llm.complete(prompt + message)
+```
+
+→ A/B prompt + outcome.
+
+### Bias monitoring
+```python
+# Subgroup performance
+for group in ['gender', 'race', 'age_bucket']:
+    for value in df[group].unique():
+        subset = df[df[group] == value]
+        acc = accuracy_score(subset.y, subset.pred)
+        log({'group': group, 'value': value, 'acc': acc})
+
+# Diff > 5% = alert
+```
+
+### Model card update
+```markdown
+## Monitoring (live)
+
+- Last update: 2026-05-09
+- Drift: stable (PSI 0.05)
+- Latency p99: 78ms
+- Error rate: 0.2%
+- Accuracy (last 7d): 0.86 (↓0.01 from baseline)
+```
+
+### Retrain trigger
+```
+Trigger:
+- Drift > threshold
+- Performance drop > 5%
+- 매 N day
+- New data 양 > X
+
+→ 자동 retrain pipeline (Airflow / Vertex / SageMaker).
+```
+
+### LLM eval suite
+```python
+# Promptfoo / LangSmith
+tests = [
+    {'input': 'What is 2+2?', 'expected': '4'},
+    {'input': 'Capital of France?', 'expected': 'Paris'},
+]
+
+for t in tests:
+    actual = llm.complete(t['input'])
+    pass_ = match(actual, t['expected'])
+    log({'test': t, 'pass': pass_})
+```
+
+→ Regression suite — 매 deploy.
+
+### Production debugging
+```
+Bad prediction 발견:
+1. Input log — feature 가 outlier?
+2. Model version — recent change?
+3. Data pipeline — data 변경?
+4. 5W1H trace
+```
+
+### Privacy
+```
+Log 가 PII 가 있을 수.
+- Hash / mask before log
+- Retention policy (30일 후 삭제)
+- GDPR / 사용자 삭제 요청
+```
+
+## 🤔 의사결정 기준
+| 작업 | 추천 |
+|---|---|
+| Drift 감지 | PSI / KS test / Evidently |
+| Latency / cost | Prometheus + Grafana |
+| Performance lag | Proxy metric |
+| Compare new model | Shadow / A/B |
+| Bias | Subgroup analysis |
+| LLM | Helicone / LangSmith |
+| Auto retrain | Pipeline trigger |
+
+## ❌ 안티패턴
+- **No monitoring**: silent decay.
+- **Offline metric 만**: prod 차이 모름.
+- **Ground truth 안 옴 = OK 가정**: 잘 못됨.
+- **Drift threshold 없음**: alert noise / miss.
+- **Subgroup 분석 안 함**: bias 잠재.
+- **Cost 추적 X**: 폭발.
+- **Retrain manual**: 늦어짐.
+
+## 🤖 LLM 활용 힌트
+- PSI / KS = drift 표준 metric.
+- Shadow / A/B 가 안전한 deploy.
+- Proxy metric 가 lag 답.
+- Evidently / Arize / WhyLabs 가 ecosystem.
+
+## 🔗 관련 문서
+- [[MLOps_Model_Registry]]
+- [[AI_LLM_Eval_Patterns]]
+- [[Observability_RED_USE_Metrics]]