[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,340 @@
|
||||
---
|
||||
id: ai-continuous-learning-system
|
||||
title: AI Continuous Learning — feedback loop / RLAIF
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, continuous, vibe-coding]
|
||||
tech_stack: { language: "Python / TS", applicable_to: ["AI"] }
|
||||
applied_in: []
|
||||
aliases: [continuous learning, feedback loop, RLAIF, online learning, model drift, A/B test, golden set]
|
||||
---
|
||||
|
||||
# AI Continuous Learning
|
||||
|
||||
> Production model 가 stale. **Feedback collection → eval → fine-tune / prompt update**. RLAIF (AI-feedback) 의 modern.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Production traffic 가 dataset.
|
||||
- Feedback (thumbs, click).
|
||||
- Drift detect.
|
||||
- A/B test 매 변경.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Feedback collection
|
||||
```ts
|
||||
// 매 LLM response 의 metadata
|
||||
const responseId = crypto.randomUUID();
|
||||
log({ responseId, query, response, model, latency, cost });
|
||||
|
||||
// User feedback
|
||||
app.post('/feedback', (req, res) => {
|
||||
log({ responseId: req.body.id, rating: req.body.rating });
|
||||
});
|
||||
```
|
||||
|
||||
### Implicit feedback
|
||||
```
|
||||
- Click (helpful).
|
||||
- Dwell time (engaged).
|
||||
- Re-query (unhelpful).
|
||||
- Conversation continuation.
|
||||
|
||||
→ Explicit (thumbs) 보다 더 많음.
|
||||
```
|
||||
|
||||
### Golden set 의 evolution
|
||||
```
|
||||
1. Initial 50 case (manual).
|
||||
2. Production 의 bad case → 추가.
|
||||
3. Production 의 ambiguous case → expert review.
|
||||
4. 매 month + 매 release.
|
||||
|
||||
→ Golden set 가 grow.
|
||||
```
|
||||
|
||||
### Drift detection
|
||||
```python
|
||||
# Embedding 의 distribution change.
|
||||
ref_embeds = train_embeddings # 옛.
|
||||
prod_embeds = recent_query_embeddings
|
||||
|
||||
# KS test 또는 PSI.
|
||||
from scipy.stats import ks_2samp
|
||||
|
||||
stat, p = ks_2samp(ref_embeds[:, 0], prod_embeds[:, 0])
|
||||
if p < 0.05:
|
||||
alert('drift detected')
|
||||
```
|
||||
|
||||
### Performance drift
|
||||
```python
|
||||
# 매 day 의 user satisfaction.
|
||||
satisfaction = avg(thumbs_up / total)
|
||||
trend = rolling_30day(satisfaction)
|
||||
|
||||
if trend.slope < 0:
|
||||
alert('quality declining')
|
||||
```
|
||||
|
||||
### A/B test (model / prompt)
|
||||
```python
|
||||
def get_response(query, user):
|
||||
# 10% B.
|
||||
if hash(user.id) % 100 < 10:
|
||||
return new_model.generate(query), 'B'
|
||||
else:
|
||||
return current_model.generate(query), 'A'
|
||||
|
||||
# Log:
|
||||
# bucket: A | B
|
||||
# metric: clicked, dwell, ...
|
||||
|
||||
# After 1 week:
|
||||
# A: 0.65 satisfaction
|
||||
# B: 0.70
|
||||
# B win → roll out.
|
||||
```
|
||||
|
||||
### RLHF / DPO update
|
||||
```
|
||||
1. Production conversation collect.
|
||||
2. Annotator (또는 LLM-as-judge) 가 prefer.
|
||||
3. DPO train new model.
|
||||
4. A/B test.
|
||||
5. Replace.
|
||||
|
||||
→ 매 month / quarter.
|
||||
```
|
||||
|
||||
→ [[AI_RLHF_DPO_Basics]].
|
||||
|
||||
### RLAIF (AI feedback)
|
||||
```
|
||||
Human feedback 비싼.
|
||||
- LLM (judge) 가 prefer.
|
||||
- Cheap + scalable.
|
||||
- Quality 가 human 보다 낮 가, "good enough".
|
||||
|
||||
→ Anthropic Constitutional AI 식.
|
||||
```
|
||||
|
||||
### Prompt 의 continuous improve
|
||||
```
|
||||
1. 매 prompt version.
|
||||
2. A/B test.
|
||||
3. Eval set 의 score.
|
||||
4. Best prompt → deploy.
|
||||
|
||||
→ Prompt 가 매 day 다름 가능.
|
||||
```
|
||||
|
||||
### Retrieval (RAG) 의 update
|
||||
```
|
||||
1. New doc 가 vector DB 에.
|
||||
2. 옛 doc 의 update / delete.
|
||||
3. Embedding model upgrade → re-embed.
|
||||
|
||||
→ Continuous.
|
||||
```
|
||||
|
||||
### Cost vs quality
|
||||
```
|
||||
매 update:
|
||||
- Annotation cost.
|
||||
- Compute (eval, re-train).
|
||||
- Deploy risk.
|
||||
|
||||
→ Quality gain 가 cost 정당화.
|
||||
```
|
||||
|
||||
### Model registry (continuous)
|
||||
```
|
||||
v1.0: initial.
|
||||
v1.1: prompt update.
|
||||
v1.2: new RAG data.
|
||||
v2.0: fine-tune.
|
||||
|
||||
→ 매 version 의 metric 추적.
|
||||
```
|
||||
|
||||
→ [[MLOps_Model_Registry]].
|
||||
|
||||
### Failure mode 추적
|
||||
```
|
||||
"User 가 'I don't know' 받음" = bad.
|
||||
Categorize:
|
||||
- Out of scope.
|
||||
- Hallucination.
|
||||
- Refusal (safety 가 too strict).
|
||||
- Format violation.
|
||||
|
||||
→ Pattern 가 fix priority.
|
||||
```
|
||||
|
||||
### Continuous eval (CI)
|
||||
```yaml
|
||||
- run: python eval.py --model latest --golden ./golden.jsonl
|
||||
- run: |
|
||||
if [[ $PASS_RATE -lt 0.85 ]]; then exit 1; fi
|
||||
```
|
||||
|
||||
→ 매 release 의 quality gate.
|
||||
|
||||
### Shadow deployment
|
||||
```ts
|
||||
// Production 가 v1.
|
||||
// v2 도 run, log only.
|
||||
|
||||
const v1 = await model.v1.generate(query);
|
||||
asyncio.create_task(model.v2.generate(query)); // shadow
|
||||
|
||||
return v1;
|
||||
```
|
||||
|
||||
→ Risk 없이 비교.
|
||||
|
||||
### Canary
|
||||
```ts
|
||||
// 1% traffic = v2.
|
||||
if (rand() < 0.01) {
|
||||
return v2.generate(query);
|
||||
}
|
||||
return v1.generate(query);
|
||||
```
|
||||
|
||||
→ 점진 ramp.
|
||||
|
||||
### Observability (tracing)
|
||||
```python
|
||||
# LangSmith / Helicone / Langfuse.
|
||||
@trace
|
||||
def chat(query):
|
||||
# 자동 record.
|
||||
...
|
||||
```
|
||||
|
||||
→ 매 production query 가 visible.
|
||||
|
||||
### Prompt version
|
||||
```ts
|
||||
const PROMPTS = {
|
||||
v1: 'Answer briefly: {q}',
|
||||
v2: 'Provide a 3-sentence answer: {q}',
|
||||
v3: 'You are an expert. Answer with citations: {q}',
|
||||
};
|
||||
|
||||
const prompt = PROMPTS[user.bucket ?? 'v1'];
|
||||
```
|
||||
|
||||
### LangSmith eval
|
||||
```python
|
||||
from langsmith import Client
|
||||
|
||||
client = Client()
|
||||
results = client.run_on_dataset(
|
||||
dataset_name='production-golden',
|
||||
llm_or_chain=chain,
|
||||
evaluation=RunEvalConfig(evaluators=['qa', 'context-relevance']),
|
||||
)
|
||||
```
|
||||
|
||||
### Continuous improvement loop
|
||||
```
|
||||
1. Production logs.
|
||||
2. Bad case detection (LLM-judge).
|
||||
3. Annotation queue.
|
||||
4. Human review.
|
||||
5. Add to golden set.
|
||||
6. Re-eval current model.
|
||||
7. Iterate (prompt / RAG / fine-tune).
|
||||
8. Deploy.
|
||||
9. Monitor.
|
||||
10. Repeat.
|
||||
|
||||
→ "Flywheel" 의 model.
|
||||
```
|
||||
|
||||
### When NOT continuous?
|
||||
```
|
||||
- 작은 system / static knowledge.
|
||||
- Compliance critical (변경 risk).
|
||||
- Regulated domain (medical, legal).
|
||||
|
||||
→ 매 변경 = approval + audit.
|
||||
```
|
||||
|
||||
### Real-world
|
||||
- **ChatGPT**: 매월 update.
|
||||
- **GitHub Copilot**: continuous training.
|
||||
- **Notion AI**: feedback-driven.
|
||||
- **Cursor**: model 매 PR.
|
||||
|
||||
### 함정
|
||||
```
|
||||
- Drift 무시: silent degradation.
|
||||
- A/B 없이 deploy: 가짜 improvement.
|
||||
- Annotation bias: bad golden set.
|
||||
- Eval set leak: 가짜 score.
|
||||
- 매 day deploy + bug: chaos.
|
||||
```
|
||||
|
||||
### Privacy
|
||||
```
|
||||
- User data 가 train: 동의.
|
||||
- PII strip.
|
||||
- Right to delete (GDPR).
|
||||
- Audit log.
|
||||
|
||||
→ Compliance 의 큰 deal.
|
||||
```
|
||||
|
||||
### Rate of update
|
||||
```
|
||||
- Daily prompt: OK (low risk).
|
||||
- Weekly fine-tune: OK with eval.
|
||||
- Monthly model: stable.
|
||||
- Quarterly major: 큰 change + announce.
|
||||
```
|
||||
|
||||
### Tools
|
||||
```
|
||||
- LangSmith / Langfuse: trace + eval.
|
||||
- Helicone: observability.
|
||||
- Promptfoo: regression eval.
|
||||
- Modal / W&B: training.
|
||||
- Inngest: continuous workflow.
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 작업 | 추천 |
|
||||
|---|---|
|
||||
| Trace | LangSmith / Helicone |
|
||||
| Eval | Promptfoo / Inspect |
|
||||
| A/B | Bucket + log + analysis |
|
||||
| Drift | KS / PSI on embedding |
|
||||
| Continuous train | DPO + golden set |
|
||||
| Annotation | Manual + LLM-judge |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **No feedback collect**: blind.
|
||||
- **No A/B test**: 가짜 improvement.
|
||||
- **Eval leak**: 가짜 score.
|
||||
- **Manual annotation 만**: slow.
|
||||
- **No drift detect**: silent decay.
|
||||
- **Privacy 무시**: leak.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Production 가 dataset.
|
||||
- A/B test + LLM-judge + golden set.
|
||||
- Drift detect (KS / PSI).
|
||||
- Continuous = flywheel.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[MLOps_Model_Monitoring]]
|
||||
- [[AI_RLHF_DPO_Basics]]
|
||||
- [[AI_Eval_Framework_Modern]]
|
||||
Reference in New Issue
Block a user