--- id: mlops-model-monitoring title: ML Monitoring — drift / quality / SLO category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [mlops, monitoring, vibe-coding] tech_stack: { language: "Python", applicable_to: ["AI", "Backend"] } applied_in: [] aliases: [ML monitoring, drift detection, data drift, concept drift, model decay, Evidently] --- # ML Monitoring > Model 가 시간 따라 decay. **Data drift, concept drift, prediction drift, performance drop**. Evidently / Arize / Fiddler / WhyLabs. ## 📖 핵심 개념 - Data drift: 입력 분포 변화. - Concept drift: 입력 → output 관계 변화. - Prediction drift: output 분포 변화. - Performance: ground truth 와 비교 (delay). ## 💻 코드 패턴 ### KS test (data drift) ```python from scipy.stats import ks_2samp ref = train_data['feature_x'] prod = recent_data['feature_x'] stat, pval = ks_2samp(ref, prod) if pval < 0.05: alert(f'feature_x drift! p={pval:.3f}') ``` → 두 분포 다름 = drift. ### PSI (Population Stability Index) ```python def psi(reference, current, bins=10): bins = np.linspace(reference.min(), reference.max(), bins + 1) ref_hist = np.histogram(reference, bins)[0] / len(reference) cur_hist = np.histogram(current, bins)[0] / len(current) # Avoid log(0) ref_hist = np.where(ref_hist == 0, 0.0001, ref_hist) cur_hist = np.where(cur_hist == 0, 0.0001, cur_hist) return np.sum((cur_hist - ref_hist) * np.log(cur_hist / ref_hist)) # < 0.1 = stable, 0.1-0.2 = some, > 0.2 = significant ``` ### Evidently (open source) ```python from evidently.report import Report from evidently.metric_preset import DataDriftPreset, RegressionPreset report = Report(metrics=[DataDriftPreset(), RegressionPreset()]) report.run(reference_data=ref, current_data=prod) report.save_html('drift_report.html') ``` → Dashboard / drift detect / alert. ### Arize / WhyLabs (managed) ```python import arize client = arize.Client(api_key=...) client.log( model_id='churn', model_version='v3.1', prediction_id=pred_id, features=feat, prediction=pred, actual=actual, # 나중 도착 ) ``` ### Concept drift detection ```python # Performance 가 시간 따라 ↓ # rolling window accuracy def rolling_accuracy(predictions, actuals, window=1000): return [ accuracy_score(actuals[i:i+window], predictions[i:i+window]) for i in range(0, len(predictions) - window, 100) ] # Plot — 떨어지는 trend = drift ``` ### Prediction drift ```python # Output 분포 추적 prod_mean = recent_predictions.mean() prod_std = recent_predictions.std() ref_mean = train_predictions.mean() if abs(prod_mean - ref_mean) > 2 * train_predictions.std(): alert('prediction drift') ``` ### Latency / availability SLO ```python # Prom metrics inference_latency = Histogram( 'inference_latency_seconds', 'Inference latency', ['model'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0], ) with inference_latency.labels(model='churn').time(): pred = model.predict(features) ``` → p99 latency < 100ms 같은 SLO. ### Ground truth lag ``` Click prediction: 1 sec 후 OK Churn 7 days: 7 일 후 ground truth Loan default: 30 days+ → 실시간 metric 가 안 됨. Proxy metric 사용. ``` ### Proxy metric ``` Click model: - 직접: actual click rate - Proxy: dwell time, scroll depth LLM: - 직접: human eval - Proxy: thumbs up / down, regen rate ``` ### Outlier detection ```python from sklearn.ensemble import IsolationForest iforest = IsolationForest().fit(train_features) # 매 inference anomaly_score = iforest.decision_function([features]) if anomaly_score < -0.5: log.warn('outlier input', features=features) ``` → Train data 와 다른 input = warn. ### Feedback loop ```python # User correction @app.post('/feedback') def feedback(prediction_id: str, correct: bool): db.update(prediction_id, actual=correct) # Retrain trigger if recent_corrections.error_rate > 0.1: trigger_retrain() ``` ### Online evaluation (LLM) ```python # Helicone / Langsmith / Promptfoo @trace def llm_call(prompt): return llm.complete(prompt) # Auto: latency, cost, error # Manual: user thumbs up/down ``` ### Shadow deployment ```python # Prod traffic → 둘 다 — old + new @app.post('/predict') def predict(features): pred_old = old_model.predict(features) # Shadow asyncio.create_task(log_shadow(features, new_model.predict(features))) return pred_old ``` → New model 가 안 사용 — but log 가 됨. 비교. ### A/B test ```python def predict(features, user_id): if hash(user_id) % 100 < 10: # 10% B pred = new_model.predict(features) bucket = 'B' else: pred = old_model.predict(features) bucket = 'A' log({'bucket': bucket, 'pred': pred}) return pred ``` → Bucket 별 outcome (CTR, conversion) 비교. ### Cost ```python # LLM import openai r = openai.chat.completions.create(...) cost = r.usage.total_tokens * 0.00001 prom_cost.labels(model='gpt-4').inc(cost) ``` → Per request cost 추적. Budget alert. ### Prompt 변경 추적 ```python # LangSmith / Helicone @traceable def chat(message: str, prompt_version: str = 'v3'): prompt = PROMPTS[prompt_version] return llm.complete(prompt + message) ``` → A/B prompt + outcome. ### Bias monitoring ```python # Subgroup performance for group in ['gender', 'race', 'age_bucket']: for value in df[group].unique(): subset = df[df[group] == value] acc = accuracy_score(subset.y, subset.pred) log({'group': group, 'value': value, 'acc': acc}) # Diff > 5% = alert ``` ### Model card update ```markdown ## Monitoring (live) - Last update: 2026-05-09 - Drift: stable (PSI 0.05) - Latency p99: 78ms - Error rate: 0.2% - Accuracy (last 7d): 0.86 (↓0.01 from baseline) ``` ### Retrain trigger ``` Trigger: - Drift > threshold - Performance drop > 5% - 매 N day - New data 양 > X → 자동 retrain pipeline (Airflow / Vertex / SageMaker). ``` ### LLM eval suite ```python # Promptfoo / LangSmith tests = [ {'input': 'What is 2+2?', 'expected': '4'}, {'input': 'Capital of France?', 'expected': 'Paris'}, ] for t in tests: actual = llm.complete(t['input']) pass_ = match(actual, t['expected']) log({'test': t, 'pass': pass_}) ``` → Regression suite — 매 deploy. ### Production debugging ``` Bad prediction 발견: 1. Input log — feature 가 outlier? 2. Model version — recent change? 3. Data pipeline — data 변경? 4. 5W1H trace ``` ### Privacy ``` Log 가 PII 가 있을 수. - Hash / mask before log - Retention policy (30일 후 삭제) - GDPR / 사용자 삭제 요청 ``` ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | Drift 감지 | PSI / KS test / Evidently | | Latency / cost | Prometheus + Grafana | | Performance lag | Proxy metric | | Compare new model | Shadow / A/B | | Bias | Subgroup analysis | | LLM | Helicone / LangSmith | | Auto retrain | Pipeline trigger | ## ❌ 안티패턴 - **No monitoring**: silent decay. - **Offline metric 만**: prod 차이 모름. - **Ground truth 안 옴 = OK 가정**: 잘 못됨. - **Drift threshold 없음**: alert noise / miss. - **Subgroup 분석 안 함**: bias 잠재. - **Cost 추적 X**: 폭발. - **Retrain manual**: 늦어짐. ## 🤖 LLM 활용 힌트 - PSI / KS = drift 표준 metric. - Shadow / A/B 가 안전한 deploy. - Proxy metric 가 lag 답. - Evidently / Arize / WhyLabs 가 ecosystem. ## 🔗 관련 문서 - [[MLOps_Model_Registry]] - [[AI_LLM_Eval_Patterns]] - [[Observability_RED_USE_Metrics]]