[G1-Sync] Manual knowledge update

2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,166 @@
+---
+id: ai-llm-eval-patterns
+title: LLM Evaluation — Golden Set / LLM-as-Judge / 회귀
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [ai, llm, eval, testing, vibe-coding]
+tech_stack: { language: "TS / Python", applicable_to: ["Backend"] }
+applied_in: []
+aliases: [LLM eval, golden dataset, LLM-as-judge, regression, Promptfoo, Braintrust]
+---
+
+# LLM Evaluation
+
+> "느낌상 좋아짐" 은 측정 X. **golden dataset + 자동 채점**. Prompt 변경 / 모델 변경 시 회귀 검출. Promptfoo / Braintrust / LangSmith.
+
+## 📖 핵심 개념
+- Golden set: input + expected output 쌍.
+- Metric: exact match / similarity / structured / LLM-as-judge.
+- Eval = unit test for LLM. 매 PR 마다 실행.
+- LLM-as-judge: 정답이 자유 형식일 때 다른 LLM 이 채점.
+
+## 💻 코드 패턴
+
+### 단순 자체 eval
+```ts
+const cases = [
+  { input: '2+2', expected: '4' },
+  { input: 'capital of France', expected: 'Paris' },
+];
+
+let pass = 0;
+for (const c of cases) {
+  const out = await callLLM(c.input);
+  if (out.includes(c.expected)) pass++;
+  else console.log('FAIL', c.input, '→', out);
+}
+console.log(`${pass}/${cases.length}`);
+```
+
+### Promptfoo (yaml)
+```yaml
+# promptfooconfig.yaml
+prompts:
+  - "Answer concisely: {{question}}"
+
+providers:
+  - openai:gpt-4o-mini
+  - openai:gpt-4o
+  - anthropic:claude-haiku-4-5
+
+tests:
+  - vars: { question: "Capital of France?" }
+    assert:
+      - type: contains
+        value: "Paris"
+      - type: latency
+        threshold: 2000
+      - type: cost
+        threshold: 0.001
+
+  - vars: { question: "Bank vault security tips" }
+    assert:
+      - type: llm-rubric
+        value: "Lists at least 3 security measures, mentions surveillance"
+```
+
+```bash
+promptfoo eval
+```
+
+### LLM-as-judge
+```ts
+async function judge(input: string, output: string, criteria: string): Promise<{score: number, reason: string}> {
+  const r = await openai.chat.completions.create({
+    model: 'gpt-4o',
+    messages: [
+      { role: 'system', content: 'You are a strict evaluator. Score 0-5. Output JSON: {"score":N,"reason":"..."}' },
+      { role: 'user', content: `Input: ${input}\nOutput: ${output}\nCriteria: ${criteria}` },
+    ],
+    response_format: { type: 'json_object' },
+  });
+  return JSON.parse(r.choices[0].message.content!);
+}
+```
+
+### Pairwise comparison (A vs B)
+```ts
+// 실험: 두 prompt 결과 — 어느 게 나은지
+async function pairwise(input: string, outA: string, outB: string) {
+  const r = await openai.chat.completions.create({
+    model: 'gpt-4o',
+    messages: [{ role: 'user', content: `Compare A and B for "${input}".\nA: ${outA}\nB: ${outB}\nWhich is better and why? JSON: {"winner":"A"|"B"|"tie","reason":"..."}` }],
+    response_format: { type: 'json_object' },
+  });
+  return JSON.parse(r.choices[0].message.content!);
+}
+```
+
+### Structured output 검증
+```ts
+import { Recipe } from './schemas';
+
+const out = await callLLM(prompt);
+const parsed = Recipe.safeParse(out);
+expect(parsed.success).toBe(true);
+if (!parsed.success) console.log(parsed.error);
+```
+
+### Latency / cost 추적
+```ts
+const start = Date.now();
+const r = await openai.chat.completions.create({...});
+const ms = Date.now() - start;
+const usage = r.usage!;
+const cost = usage.prompt_tokens * 2.5e-6 + usage.completion_tokens * 1e-5;
+
+track('llm.eval', { ms, cost, prompt_tokens: usage.prompt_tokens });
+```
+
+### CI 회귀
+```yaml
+# .github/workflows/llm-eval.yml
+on: [pull_request]
+jobs:
+  eval:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - run: npm i
+      - run: npx promptfoo eval --output report.json
+      - run: node scripts/check-regression.js report.json
+        # baseline 점수보다 5% 이상 하락 시 실패
+```
+
+## 🤔 의사결정 기준
+| 출력 종류 | 채점 |
+|---|---|
+| Exact answer | exact match / contains |
+| JSON / 구조 | Schema parse |
+| 분류 | accuracy / F1 |
+| 자유 텍스트 | LLM-as-judge / rouge / BLEU |
+| 비교 (어느 게 나아?) | pairwise A/B |
+| 실제 사용자 신호 | thumbs up/down / 재질문률 |
+
+## ❌ 안티패턴
+- **Eval 없이 prod 배포**: 회귀 검출 불가.
+- **Test set 작음 (5개)**: 변동 큼. 50+ 권장.
+- **Test set leak (학습에 사용)**: 거짓 점수.
+- **LLM-as-judge — 같은 모델로 채점**: 자기 편향.
+- **Cost / latency 무시**: 정확도만 보면 비용 폭발.
+- **Production 못 배포 — 매번 eval**: 작은 hot-set 만 매 PR, 큰 건 nightly.
+- **Subjective only — 자동화 X**: 매번 사람 — 못 scale.
+
+## 🤖 LLM 활용 힌트
+- Promptfoo / Braintrust / LangSmith 권장.
+- LLM-as-judge 는 다른 모델로.
+- 회귀 5% 임계값 + cost / latency 같이.
+
+## 🔗 관련 문서
+- [[AI_Prompt_Engineering_Patterns]]
+- [[AI_Structured_Output_Zod]]
+- [[AI_RAG_Pattern_Basics]]