---
id: ai-continuous-learning-system
title: AI Continuous Learning — feedback loop / RLAIF
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, continuous, vibe-coding]
tech_stack: { language: "Python / TS", applicable_to: ["AI"] }
applied_in: []
aliases: [continuous learning, feedback loop, RLAIF, online learning, model drift, A/B test, golden set]
---

# AI Continuous Learning

> Production model 가 stale. **Feedback collection → eval → fine-tune / prompt update**. RLAIF (AI-feedback) 의 modern.

## 📖 핵심 개념
- Production traffic 가 dataset.
- Feedback (thumbs, click).
- Drift detect.
- A/B test 매 변경.

## 💻 코드 패턴

### Feedback collection
```ts
// 매 LLM response 의 metadata
const responseId = crypto.randomUUID();
log({ responseId, query, response, model, latency, cost });

// User feedback
app.post('/feedback', (req, res) => {
  log({ responseId: req.body.id, rating: req.body.rating });
});
```

### Implicit feedback
```
- Click (helpful).
- Dwell time (engaged).
- Re-query (unhelpful).
- Conversation continuation.

→ Explicit (thumbs) 보다 더 많음.
```

### Golden set 의 evolution
```
1. Initial 50 case (manual).
2. Production 의 bad case → 추가.
3. Production 의 ambiguous case → expert review.
4. 매 month + 매 release.

→ Golden set 가 grow.
```

### Drift detection
```python
# Embedding 의 distribution change.
ref_embeds = train_embeddings  # 옛.
prod_embeds = recent_query_embeddings

# KS test 또는 PSI.
from scipy.stats import ks_2samp

stat, p = ks_2samp(ref_embeds[:, 0], prod_embeds[:, 0])
if p < 0.05:
    alert('drift detected')
```

### Performance drift
```python
# 매 day 의 user satisfaction.
satisfaction = avg(thumbs_up / total)
trend = rolling_30day(satisfaction)

if trend.slope < 0:
    alert('quality declining')
```

### A/B test (model / prompt)
```python
def get_response(query, user):
    # 10% B.
    if hash(user.id) % 100 < 10:
        return new_model.generate(query), 'B'
    else:
        return current_model.generate(query), 'A'

# Log:
# bucket: A | B
# metric: clicked, dwell, ...

# After 1 week:
# A: 0.65 satisfaction
# B: 0.70
# B win → roll out.
```

### RLHF / DPO update
```
1. Production conversation collect.
2. Annotator (또는 LLM-as-judge) 가 prefer.
3. DPO train new model.
4. A/B test.
5. Replace.

→ 매 month / quarter.
```

→ [[AI_RLHF_DPO_Basics]].

### RLAIF (AI feedback)
```
Human feedback 비싼.
- LLM (judge) 가 prefer.
- Cheap + scalable.
- Quality 가 human 보다 낮 가, "good enough".

→ Anthropic Constitutional AI 식.
```

### Prompt 의 continuous improve
```
1. 매 prompt version.
2. A/B test.
3. Eval set 의 score.
4. Best prompt → deploy.

→ Prompt 가 매 day 다름 가능.
```

### Retrieval (RAG) 의 update
```
1. New doc 가 vector DB 에.
2. 옛 doc 의 update / delete.
3. Embedding model upgrade → re-embed.

→ Continuous.
```

### Cost vs quality
```
매 update:
- Annotation cost.
- Compute (eval, re-train).
- Deploy risk.

→ Quality gain 가 cost 정당화.
```

### Model registry (continuous)
```
v1.0: initial.
v1.1: prompt update.
v1.2: new RAG data.
v2.0: fine-tune.

→ 매 version 의 metric 추적.
```

→ [[MLOps_Model_Registry]].

### Failure mode 추적
```
"User 가 'I don't know' 받음" = bad.
Categorize:
- Out of scope.
- Hallucination.
- Refusal (safety 가 too strict).
- Format violation.

→ Pattern 가 fix priority.
```

### Continuous eval (CI)
```yaml
- run: python eval.py --model latest --golden ./golden.jsonl
- run: |
    if [[ $PASS_RATE -lt 0.85 ]]; then exit 1; fi
```

→ 매 release 의 quality gate.

### Shadow deployment
```ts
// Production 가 v1.
// v2 도 run, log only.

const v1 = await model.v1.generate(query);
asyncio.create_task(model.v2.generate(query));   // shadow

return v1;
```

→ Risk 없이 비교.

### Canary
```ts
// 1% traffic = v2.
if (rand() < 0.01) {
    return v2.generate(query);
}
return v1.generate(query);
```

→ 점진 ramp.

### Observability (tracing)
```python
# LangSmith / Helicone / Langfuse.
@trace
def chat(query):
    # 자동 record.
    ...
```

→ 매 production query 가 visible.

### Prompt version
```ts
const PROMPTS = {
  v1: 'Answer briefly: {q}',
  v2: 'Provide a 3-sentence answer: {q}',
  v3: 'You are an expert. Answer with citations: {q}',
};

const prompt = PROMPTS[user.bucket ?? 'v1'];
```

### LangSmith eval
```python
from langsmith import Client

client = Client()
results = client.run_on_dataset(
    dataset_name='production-golden',
    llm_or_chain=chain,
    evaluation=RunEvalConfig(evaluators=['qa', 'context-relevance']),
)
```

### Continuous improvement loop
```
1. Production logs.
2. Bad case detection (LLM-judge).
3. Annotation queue.
4. Human review.
5. Add to golden set.
6. Re-eval current model.
7. Iterate (prompt / RAG / fine-tune).
8. Deploy.
9. Monitor.
10. Repeat.

→ "Flywheel" 의 model.
```

### When NOT continuous?
```
- 작은 system / static knowledge.
- Compliance critical (변경 risk).
- Regulated domain (medical, legal).

→ 매 변경 = approval + audit.
```

### Real-world
- **ChatGPT**: 매월 update.
- **GitHub Copilot**: continuous training.
- **Notion AI**: feedback-driven.
- **Cursor**: model 매 PR.

### 함정
```
- Drift 무시: silent degradation.
- A/B 없이 deploy: 가짜 improvement.
- Annotation bias: bad golden set.
- Eval set leak: 가짜 score.
- 매 day deploy + bug: chaos.
```

### Privacy
```
- User data 가 train: 동의.
- PII strip.
- Right to delete (GDPR).
- Audit log.

→ Compliance 의 큰 deal.
```

### Rate of update
```
- Daily prompt: OK (low risk).
- Weekly fine-tune: OK with eval.
- Monthly model: stable.
- Quarterly major: 큰 change + announce.
```

### Tools
```
- LangSmith / Langfuse: trace + eval.
- Helicone: observability.
- Promptfoo: regression eval.
- Modal / W&B: training.
- Inngest: continuous workflow.
```

## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Trace | LangSmith / Helicone |
| Eval | Promptfoo / Inspect |
| A/B | Bucket + log + analysis |
| Drift | KS / PSI on embedding |
| Continuous train | DPO + golden set |
| Annotation | Manual + LLM-judge |

## ❌ 안티패턴
- **No feedback collect**: blind.
- **No A/B test**: 가짜 improvement.
- **Eval leak**: 가짜 score.
- **Manual annotation 만**: slow.
- **No drift detect**: silent decay.
- **Privacy 무시**: leak.

## 🤖 LLM 활용 힌트
- Production 가 dataset.
- A/B test + LLM-judge + golden set.
- Drift detect (KS / PSI).
- Continuous = flywheel.

## 🔗 관련 문서
- [[MLOps_Model_Monitoring]]
- [[AI_RLHF_DPO_Basics]]
- [[AI_Eval_Framework_Modern]]