--- id: ai-llm-eval-patterns title: LLM Evaluation — Golden Set / LLM-as-Judge / 회귀 category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, llm, eval, testing, vibe-coding] tech_stack: { language: "TS / Python", applicable_to: ["Backend"] } applied_in: [] aliases: [LLM eval, golden dataset, LLM-as-judge, regression, Promptfoo, Braintrust] --- # LLM Evaluation > "느낌상 좋아짐" 은 측정 X. **golden dataset + 자동 채점**. Prompt 변경 / 모델 변경 시 회귀 검출. Promptfoo / Braintrust / LangSmith. ## 📖 핵심 개념 - Golden set: input + expected output 쌍. - Metric: exact match / similarity / structured / LLM-as-judge. - Eval = unit test for LLM. 매 PR 마다 실행. - LLM-as-judge: 정답이 자유 형식일 때 다른 LLM 이 채점. ## 💻 코드 패턴 ### 단순 자체 eval ```ts const cases = [ { input: '2+2', expected: '4' }, { input: 'capital of France', expected: 'Paris' }, ]; let pass = 0; for (const c of cases) { const out = await callLLM(c.input); if (out.includes(c.expected)) pass++; else console.log('FAIL', c.input, '→', out); } console.log(`${pass}/${cases.length}`); ``` ### Promptfoo (yaml) ```yaml # promptfooconfig.yaml prompts: - "Answer concisely: {{question}}" providers: - openai:gpt-4o-mini - openai:gpt-4o - anthropic:claude-haiku-4-5 tests: - vars: { question: "Capital of France?" } assert: - type: contains value: "Paris" - type: latency threshold: 2000 - type: cost threshold: 0.001 - vars: { question: "Bank vault security tips" } assert: - type: llm-rubric value: "Lists at least 3 security measures, mentions surveillance" ``` ```bash promptfoo eval ``` ### LLM-as-judge ```ts async function judge(input: string, output: string, criteria: string): Promise<{score: number, reason: string}> { const r = await openai.chat.completions.create({ model: 'gpt-4o', messages: [ { role: 'system', content: 'You are a strict evaluator. Score 0-5. Output JSON: {"score":N,"reason":"..."}' }, { role: 'user', content: `Input: ${input}\nOutput: ${output}\nCriteria: ${criteria}` }, ], response_format: { type: 'json_object' }, }); return JSON.parse(r.choices[0].message.content!); } ``` ### Pairwise comparison (A vs B) ```ts // 실험: 두 prompt 결과 — 어느 게 나은지 async function pairwise(input: string, outA: string, outB: string) { const r = await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: `Compare A and B for "${input}".\nA: ${outA}\nB: ${outB}\nWhich is better and why? JSON: {"winner":"A"|"B"|"tie","reason":"..."}` }], response_format: { type: 'json_object' }, }); return JSON.parse(r.choices[0].message.content!); } ``` ### Structured output 검증 ```ts import { Recipe } from './schemas'; const out = await callLLM(prompt); const parsed = Recipe.safeParse(out); expect(parsed.success).toBe(true); if (!parsed.success) console.log(parsed.error); ``` ### Latency / cost 추적 ```ts const start = Date.now(); const r = await openai.chat.completions.create({...}); const ms = Date.now() - start; const usage = r.usage!; const cost = usage.prompt_tokens * 2.5e-6 + usage.completion_tokens * 1e-5; track('llm.eval', { ms, cost, prompt_tokens: usage.prompt_tokens }); ``` ### CI 회귀 ```yaml # .github/workflows/llm-eval.yml on: [pull_request] jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: npm i - run: npx promptfoo eval --output report.json - run: node scripts/check-regression.js report.json # baseline 점수보다 5% 이상 하락 시 실패 ``` ## 🤔 의사결정 기준 | 출력 종류 | 채점 | |---|---| | Exact answer | exact match / contains | | JSON / 구조 | Schema parse | | 분류 | accuracy / F1 | | 자유 텍스트 | LLM-as-judge / rouge / BLEU | | 비교 (어느 게 나아?) | pairwise A/B | | 실제 사용자 신호 | thumbs up/down / 재질문률 | ## ❌ 안티패턴 - **Eval 없이 prod 배포**: 회귀 검출 불가. - **Test set 작음 (5개)**: 변동 큼. 50+ 권장. - **Test set leak (학습에 사용)**: 거짓 점수. - **LLM-as-judge — 같은 모델로 채점**: 자기 편향. - **Cost / latency 무시**: 정확도만 보면 비용 폭발. - **Production 못 배포 — 매번 eval**: 작은 hot-set 만 매 PR, 큰 건 nightly. - **Subjective only — 자동화 X**: 매번 사람 — 못 scale. ## 🤖 LLM 활용 힌트 - Promptfoo / Braintrust / LangSmith 권장. - LLM-as-judge 는 다른 모델로. - 회귀 5% 임계값 + cost / latency 같이. ## 🔗 관련 문서 - [[AI_Prompt_Engineering_Patterns]] - [[AI_Structured_Output_Zod]] - [[AI_RAG_Pattern_Basics]]