Files
2nd/10_Wiki/Topics/Coding/MLOps_Model_Monitoring.md
T
2026-05-09 22:47:42 +09:00

7.5 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
mlops-model-monitoring ML Monitoring — drift / quality / SLO Coding draft B conceptual 2026-05-09 2026-05-09
mlops
monitoring
vibe-coding
language applicable_to
Python
AI
Backend
ML monitoring
drift detection
data drift
concept drift
model decay
Evidently

ML Monitoring

Model 가 시간 따라 decay. Data drift, concept drift, prediction drift, performance drop. Evidently / Arize / Fiddler / WhyLabs.

📖 핵심 개념

  • Data drift: 입력 분포 변화.
  • Concept drift: 입력 → output 관계 변화.
  • Prediction drift: output 분포 변화.
  • Performance: ground truth 와 비교 (delay).

💻 코드 패턴

KS test (data drift)

from scipy.stats import ks_2samp

ref = train_data['feature_x']
prod = recent_data['feature_x']

stat, pval = ks_2samp(ref, prod)
if pval < 0.05:
    alert(f'feature_x drift! p={pval:.3f}')

→ 두 분포 다름 = drift.

PSI (Population Stability Index)

def psi(reference, current, bins=10):
    bins = np.linspace(reference.min(), reference.max(), bins + 1)
    ref_hist = np.histogram(reference, bins)[0] / len(reference)
    cur_hist = np.histogram(current, bins)[0] / len(current)
    
    # Avoid log(0)
    ref_hist = np.where(ref_hist == 0, 0.0001, ref_hist)
    cur_hist = np.where(cur_hist == 0, 0.0001, cur_hist)
    
    return np.sum((cur_hist - ref_hist) * np.log(cur_hist / ref_hist))

# < 0.1 = stable, 0.1-0.2 = some, > 0.2 = significant

Evidently (open source)

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, RegressionPreset

report = Report(metrics=[DataDriftPreset(), RegressionPreset()])
report.run(reference_data=ref, current_data=prod)
report.save_html('drift_report.html')

→ Dashboard / drift detect / alert.

Arize / WhyLabs (managed)

import arize
client = arize.Client(api_key=...)

client.log(
    model_id='churn',
    model_version='v3.1',
    prediction_id=pred_id,
    features=feat,
    prediction=pred,
    actual=actual,  # 나중 도착
)

Concept drift detection

# Performance 가 시간 따라 ↓
# rolling window accuracy
def rolling_accuracy(predictions, actuals, window=1000):
    return [
        accuracy_score(actuals[i:i+window], predictions[i:i+window])
        for i in range(0, len(predictions) - window, 100)
    ]

# Plot — 떨어지는 trend = drift

Prediction drift

# Output 분포 추적
prod_mean = recent_predictions.mean()
prod_std = recent_predictions.std()
ref_mean = train_predictions.mean()

if abs(prod_mean - ref_mean) > 2 * train_predictions.std():
    alert('prediction drift')

Latency / availability SLO

# Prom metrics
inference_latency = Histogram(
    'inference_latency_seconds',
    'Inference latency',
    ['model'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0],
)

with inference_latency.labels(model='churn').time():
    pred = model.predict(features)

→ p99 latency < 100ms 같은 SLO.

Ground truth lag

Click prediction: 1 sec 후 OK
Churn 7 days: 7 일 후 ground truth
Loan default: 30 days+

→ 실시간 metric 가 안 됨. Proxy metric 사용.

Proxy metric

Click model:
- 직접: actual click rate
- Proxy: dwell time, scroll depth

LLM:
- 직접: human eval
- Proxy: thumbs up / down, regen rate

Outlier detection

from sklearn.ensemble import IsolationForest

iforest = IsolationForest().fit(train_features)

# 매 inference
anomaly_score = iforest.decision_function([features])
if anomaly_score < -0.5:
    log.warn('outlier input', features=features)

→ Train data 와 다른 input = warn.

Feedback loop

# User correction
@app.post('/feedback')
def feedback(prediction_id: str, correct: bool):
    db.update(prediction_id, actual=correct)
    
    # Retrain trigger
    if recent_corrections.error_rate > 0.1:
        trigger_retrain()

Online evaluation (LLM)

# Helicone / Langsmith / Promptfoo
@trace
def llm_call(prompt):
    return llm.complete(prompt)

# Auto: latency, cost, error
# Manual: user thumbs up/down

Shadow deployment

# Prod traffic → 둘 다 — old + new
@app.post('/predict')
def predict(features):
    pred_old = old_model.predict(features)
    
    # Shadow
    asyncio.create_task(log_shadow(features, new_model.predict(features)))
    
    return pred_old

→ New model 가 안 사용 — but log 가 됨. 비교.

A/B test

def predict(features, user_id):
    if hash(user_id) % 100 < 10:  # 10% B
        pred = new_model.predict(features)
        bucket = 'B'
    else:
        pred = old_model.predict(features)
        bucket = 'A'
    
    log({'bucket': bucket, 'pred': pred})
    return pred

→ Bucket 별 outcome (CTR, conversion) 비교.

Cost

# LLM
import openai
r = openai.chat.completions.create(...)
cost = r.usage.total_tokens * 0.00001

prom_cost.labels(model='gpt-4').inc(cost)

→ Per request cost 추적. Budget alert.

Prompt 변경 추적

# LangSmith / Helicone
@traceable
def chat(message: str, prompt_version: str = 'v3'):
    prompt = PROMPTS[prompt_version]
    return llm.complete(prompt + message)

→ A/B prompt + outcome.

Bias monitoring

# Subgroup performance
for group in ['gender', 'race', 'age_bucket']:
    for value in df[group].unique():
        subset = df[df[group] == value]
        acc = accuracy_score(subset.y, subset.pred)
        log({'group': group, 'value': value, 'acc': acc})

# Diff > 5% = alert

Model card update

## Monitoring (live)

- Last update: 2026-05-09
- Drift: stable (PSI 0.05)
- Latency p99: 78ms
- Error rate: 0.2%
- Accuracy (last 7d): 0.86 (↓0.01 from baseline)

Retrain trigger

Trigger:
- Drift > threshold
- Performance drop > 5%
- 매 N day
- New data 양 > X

→ 자동 retrain pipeline (Airflow / Vertex / SageMaker).

LLM eval suite

# Promptfoo / LangSmith
tests = [
    {'input': 'What is 2+2?', 'expected': '4'},
    {'input': 'Capital of France?', 'expected': 'Paris'},
]

for t in tests:
    actual = llm.complete(t['input'])
    pass_ = match(actual, t['expected'])
    log({'test': t, 'pass': pass_})

→ Regression suite — 매 deploy.

Production debugging

Bad prediction 발견:
1. Input log — feature 가 outlier?
2. Model version — recent change?
3. Data pipeline — data 변경?
4. 5W1H trace

Privacy

Log 가 PII 가 있을 수.
- Hash / mask before log
- Retention policy (30일 후 삭제)
- GDPR / 사용자 삭제 요청

🤔 의사결정 기준

작업 추천
Drift 감지 PSI / KS test / Evidently
Latency / cost Prometheus + Grafana
Performance lag Proxy metric
Compare new model Shadow / A/B
Bias Subgroup analysis
LLM Helicone / LangSmith
Auto retrain Pipeline trigger

안티패턴

  • No monitoring: silent decay.
  • Offline metric 만: prod 차이 모름.
  • Ground truth 안 옴 = OK 가정: 잘 못됨.
  • Drift threshold 없음: alert noise / miss.
  • Subgroup 분석 안 함: bias 잠재.
  • Cost 추적 X: 폭발.
  • Retrain manual: 늦어짐.

🤖 LLM 활용 힌트

  • PSI / KS = drift 표준 metric.
  • Shadow / A/B 가 안전한 deploy.
  • Proxy metric 가 lag 답.
  • Evidently / Arize / WhyLabs 가 ecosystem.

🔗 관련 문서