d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9.1 KiB
9.1 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-concept-drift | Concept Drift | 10_Wiki/Topics | verified | self |
|
none | A | 0.93 | applied |
|
2026-05-10 | pending |
|
Concept Drift
매 한 줄
"매 어제 의 정답 의 매 오늘 의 오답". 매 train-time data 의 distribution 의 deviate over time. 매 ML 의 silent killer. 매 monitoring + 매 retraining 의 essential. 매 modern: 매 LLM 의 knowledge cutoff 의 same problem.
매 핵심
매 4 type
- Sudden / Concept: 매 abrupt change (COVID).
- Gradual: 매 slow shift (inflation, language).
- Seasonal / Recurring: 매 cyclic.
- Incremental: 매 transition period.
매 distinction
- Concept drift: P(y|X) 변화 — 매 same input 의 다른 output.
- Data drift / Covariate: P(X) 변화 — 매 input distribution.
- Label drift: P(y) 변화 — 매 target frequency.
- Prior probability shift.
매 detection
Statistical
- KS test (Kolmogorov-Smirnov): 매 univariate.
- Chi-square: 매 categorical.
- PSI (Population Stability Index): 매 industry standard.
- KL / JS divergence: 매 distribution distance.
- Wasserstein.
ML-based
- Domain classifier: 매 train vs current 의 separability.
- DDM (Drift Detection Method).
- EDDM (Early DDM).
- Page-Hinkley.
- ADWIN (Adaptive Windowing).
Performance-based
- 매 accuracy / loss 의 monitor.
- 매 lag (label availability).
매 adaptation
- Periodic retrain: 매 cron schedule.
- Trigger-based: 매 drift detect → train.
- Online learning: 매 streaming.
- Ensemble with weight update.
- Sliding window.
- Active learning: 매 critical sample 의 label.
매 modern MLOps stack
- Evidently: 매 drift report.
- Alibi Detect (Seldon): 매 detection algo.
- Whylogs / WhyLabs: 매 data profiling.
- NannyML: 매 model performance estimation.
- Arize / Fiddler / Aporia: 매 observability platform.
매 LLM 의 drift
- Knowledge cutoff: 매 train date 후의 사실 X.
- Skill drift: 매 fine-tune 의 base capability lose (Catastrophic-Forgetting).
- User distribution drift: 매 new use case 의 emerge.
- Prompt drift: 매 prompt template 의 stale.
매 응용
- Fraud detection: 매 fraud pattern evolve.
- Recommendation: 매 trend shift.
- Pricing: 매 market dynamic.
- Spam: 매 evasion technique.
- Medical: 매 disease pattern.
- Demand forecasting: 매 seasonality + 매 black swan.
💻 패턴
Evidently drift report
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
reference = X_train.copy()
current = X_prod_recent.copy()
report = Report(metrics=[
DataDriftPreset(),
TargetDriftPreset(),
])
report.run(reference_data=reference, current_data=current)
report.save_html('drift_report.html')
PSI (Population Stability Index)
import numpy as np
def psi(reference, current, n_bins=10):
"""매 < 0.1: stable. 0.1-0.25: moderate drift. > 0.25: significant."""
breaks = np.percentile(reference, np.linspace(0, 100, n_bins + 1))
breaks[0] = -np.inf
breaks[-1] = np.inf
ref_pct = np.histogram(reference, breaks)[0] / len(reference)
cur_pct = np.histogram(current, breaks)[0] / len(current)
# 매 avoid log(0)
ref_pct = np.where(ref_pct == 0, 0.0001, ref_pct)
cur_pct = np.where(cur_pct == 0, 0.0001, cur_pct)
return np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
KS test
from scipy.stats import ks_2samp
def detect_drift_ks(reference, current, threshold=0.05):
stat, p_value = ks_2samp(reference, current)
drifted = p_value < threshold
return {'statistic': stat, 'p_value': p_value, 'drifted': drifted}
Domain classifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
def domain_classifier_drift(X_ref, X_cur):
"""매 ref vs current 의 separability."""
X = np.vstack([X_ref, X_cur])
y = np.array([0]*len(X_ref) + [1]*len(X_cur))
auc = cross_val_score(LogisticRegression(max_iter=1000), X, y, scoring='roc_auc', cv=5).mean()
# 매 0.5: no drift. 0.7+: significant.
return {'auc': auc, 'drifted': auc > 0.7}
Page-Hinkley (online)
class PageHinkley:
"""매 streaming drift detection."""
def __init__(self, delta=0.005, threshold=50):
self.delta = delta
self.threshold = threshold
self.cumsum = 0
self.min_cumsum = 0
self.n = 0
self.mean = 0
def update(self, x):
self.n += 1
self.mean += (x - self.mean) / self.n
self.cumsum += x - self.mean - self.delta
self.min_cumsum = min(self.min_cumsum, self.cumsum)
ph = self.cumsum - self.min_cumsum
return ph > self.threshold # 매 drift detected
Sliding window retrain
class SlidingWindowModel:
def __init__(self, base_model, window_size=10000):
self.base = base_model
self.window_size = window_size
self.X_window = []
self.y_window = []
def add(self, X, y):
self.X_window.extend(X)
self.y_window.extend(y)
if len(self.X_window) > self.window_size:
self.X_window = self.X_window[-self.window_size:]
self.y_window = self.y_window[-self.window_size:]
def retrain(self):
self.base.fit(self.X_window, self.y_window)
Online learning (river)
from river import linear_model, preprocessing, metrics
model = (preprocessing.StandardScaler() | linear_model.LogisticRegression())
metric = metrics.ROCAUC()
for x, y in stream:
y_pred = model.predict_proba_one(x)[True]
metric.update(y, y_pred)
model.learn_one(x, y)
print(metric) # 매 streaming AUC
Trigger-based retrain pipeline
def retrain_if_drift(model, monitoring_data, reference):
drift_score = psi(reference['features'], monitoring_data['features'])
perf = evaluate(model, monitoring_data)
if drift_score > 0.25 or perf < ACCEPTABLE_PERFORMANCE:
log(f'Retrain triggered: drift={drift_score:.3f}, perf={perf:.3f}')
new_model = train(combined(reference, monitoring_data))
if shadow_test(new_model, model, validation_set) > 0:
promote_to_prod(new_model)
return new_model
return model
Shadow deployment
def shadow_serve(request):
"""매 prod model + 매 candidate model 의 둘 다 의 inference."""
prod_pred = prod_model.predict(request)
candidate_pred = candidate_model.predict(request)
log_for_comparison(request, prod_pred, candidate_pred)
return prod_pred # 매 user 의 only prod 의 see
LLM knowledge cutoff handling
def llm_with_freshness(query):
# 매 1. detect freshness need
if needs_recent_info(query):
# 매 RAG with recent docs
context = search_recent(query, after=cutoff_date)
return llm.generate(query, context=context)
return llm.generate(query)
🤔 결정 기준
| 상황 | Approach |
|---|---|
| Static domain | Periodic retrain (monthly) |
| Fast-moving | Online + drift trigger |
| High-stakes | Shadow + canary |
| Streaming | Online learning (river) |
| Univariate | KS / PSI |
| Multivariate | Domain classifier |
| LLM | RAG + retrieval |
기본값: PSI + KS + performance monitor. 매 trigger threshold 의 drift detect 시 retrain.
🔗 Graph
- 부모: MLOps · Distribution-Shift
- 변형: Data-Drift · Concept-Drift
- Detection: PSI · KL-Divergence
- Tool: Evidently
- Adjacent: Catastrophic-Forgetting · Continual-Learning · Bias vs Variance Trade-off · Antifragility
🤖 LLM 활용
언제: 매 production ML monitoring. 매 drift detection. 매 retrain trigger design. 매 LLM RAG freshness. 언제 X: 매 closed system (no real-world distribution).
❌ 안티패턴
- No monitoring: 매 silent decay.
- Performance metric 만: 매 label lag.
- Single threshold: 매 false alarm.
- Retrain without test: 매 worse model.
- No baseline / reference: 매 drift 의 measure X.
- LLM 의 knowledge cutoff 의 ignore: 매 stale answer.
🧪 검증 / 중복
- Verified (Evidently docs, Webb concept drift survey, Gama drift detection).
- 신뢰도 A.
- Related: Catastrophic-Forgetting · Continual-Learning · Antifragility · Causal-Inference · Benchmarks.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — types + detection methods + 매 PSI / KS / domain classifier / Page-Hinkley / online code |