--- id: ai-eval-framework-deep title: LLM Eval Framework β€” Inspect / Promptfoo / Braintrust category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, llm, eval, framework, vibe-coding] tech_stack: { language: "TS / Python", applicable_to: ["Backend"] } applied_in: [] aliases: [Inspect AI, Promptfoo, Braintrust, LangSmith, Helicone, eval-driven development] --- # LLM Eval Framework > Eval-driven development. **Inspect AI (UK AISI), Promptfoo (OSS), Braintrust (managed), LangSmith (LangChain)**. Dataset + scorer + 비ꡐ. ## πŸ“– 핡심 κ°œλ… - Dataset: input + expected. - Scorer: 채점 (exact / similarity / LLM judge). - Run: model Γ— prompt Γ— dataset. - Trace: 각 case 의 μ‹€ν–‰ 좔적. ## πŸ’» μ½”λ“œ νŒ¨ν„΄ ### Inspect AI (Python, UK AISI) ```python from inspect_ai import Task, task, eval from inspect_ai.dataset import Sample from inspect_ai.scorer import match from inspect_ai.solver import generate @task def my_eval(): return Task( dataset=[ Sample(input='Capital of France?', target='Paris'), Sample(input='Capital of Korea?', target='Seoul'), ], plan=[generate()], scorer=match(), ) # μ‹€ν–‰ eval(my_eval(), model='anthropic/claude-opus-4-7') ``` β†’ AI safety 평가 κ°•λ ₯. ### Promptfoo (TS / OSS) ```yaml # promptfooconfig.yaml description: "Customer support eval" prompts: - "Answer the customer's question concisely:\n{{question}}" providers: - openai:gpt-4o - openai:gpt-4o-mini - anthropic:claude-opus-4-7 - anthropic:claude-haiku-4-5 tests: - vars: { question: "How do I reset my password?" } assert: - type: contains value: "/forgot-password" - type: llm-rubric value: "Provides clear step-by-step instructions" - type: latency threshold: 3000 - type: cost threshold: 0.005 - vars: { question: "Refund policy?" } assert: - type: contains-any value: ["30 days", "money back", "refund"] defaultTest: options: cache: true ``` ```bash promptfoo eval promptfoo view # web UI 비ꡐ ``` ### Promptfoo programmatic ```ts import { evaluate } from 'promptfoo'; const result = await evaluate({ prompts: ['Answer: {{q}}'], providers: ['openai:gpt-4o'], tests: [ { vars: { q: 'capital of France' }, assert: [{ type: 'contains', value: 'Paris' }] }, ], }); console.log(result.results.passCount, '/', result.results.length); ``` ### Braintrust (managed, modern) ```ts import { Eval } from 'braintrust'; await Eval('My Project', { data: () => [ { input: 'Capital of France?', expected: 'Paris' }, { input: 'Capital of Korea?', expected: 'Seoul' }, ], task: async (input) => { const r = await openai.chat.completions.create({...}); return r.choices[0].message.content!; }, scores: [ Levenshtein, LLMClassifier({ model: 'gpt-4o', criteria: 'Does the answer contain the correct city?', }), ], }); ``` β†’ Web UI μžλ™ + 비ꡐ + regression detection. ### LangSmith (LangChain) ```ts import { Client } from 'langsmith'; const client = new Client(); // Dataset await client.createExamples({ inputs: [{ question: 'Capital?' }], outputs: [{ answer: 'Paris' }], datasetId: 'capitals', }); // Run + auto trace import { evaluate } from 'langsmith/evaluation'; await evaluate(myAgent, { data: 'capitals', evaluators: [exactMatch], }); ``` ### LLM-as-judge (rubric) ```ts async function judge(input: string, output: string, criteria: string) { const r = await llm.complete({ system: `You are a strict evaluator. Score 1-5 based on criteria. Output JSON: { "score": N, "reason": "..." }`, user: `Input: ${input}\nOutput: ${output}\nCriteria: ${criteria}`, response_format: { type: 'json_object' }, }); return JSON.parse(r); } await Eval(...).addScore({ name: 'helpful', scorer: ({ input, output }) => judge(input, output, 'Is it helpful and concise?'), }); ``` ### Pairwise (A vs B) ```ts async function pairwise(input: string, outA: string, outB: string) { const r = await llm.complete({ user: `Compare A and B for query "${input}".\nA: ${outA}\nB: ${outB}\nWhich is better? JSON: { "winner": "A"|"B"|"tie", "reason": "..." }`, response_format: { type: 'json_object' }, }); return JSON.parse(r); } ``` β†’ Absolute score 보닀 pairwise κ°€ μ‚¬λžŒ νŒλ‹¨ align. ### Regression detection ```ts // CI μ•ˆ baseline 비ꡐ const current = await runEval(); const baseline = await loadBaseline(); if (current.score < baseline.score - 0.05) { console.error(`Regression: ${baseline.score} β†’ ${current.score}`); process.exit(1); } ``` ```yaml # CI - name: LLM eval run: promptfoo eval --output report.json - name: Compare to baseline run: node scripts/regression-check.js report.json ``` ### Trace + debug ```ts // LangSmith / Braintrust trace // λ§€ LLM call 의 input / output / token / latency / cost μžλ™ 기둝 // μ‹€νŒ¨ case β†’ web UI μ—μ„œ step 별 inspect ``` ### Diverse dataset ``` - Edge cases (empty, very long, special chars) - Adversarial (prompt injection) - λ‹€κ΅­μ–΄ - Real production logs (sampled) - Synthetic (LLM κ°€ generate) ``` ### Synthetic data ```ts async function generateTestCases(n: number) { const r = await llm.complete({ user: `Generate ${n} customer support questions and ideal answers. Output JSON: { "cases": [{ "question": "...", "answer": "..." }] }`, response_format: { type: 'json_object' }, }); return JSON.parse(r).cases; } ``` β†’ λΉ λ₯Έ dataset μ‹œμž‘. ### Metrics μ’…λ₯˜ ``` - Exact match (binary): yes / no - Levenshtein / similarity: 0-1 - BLEU / ROUGE: text similarity - Semantic similarity: embedding cosine - LLM-as-judge: 1-5 λ˜λŠ” binary - Cost / latency: λΉ„μš© / 속도 - Custom: domain-specific ``` ### Per-task vs holistic ``` Per-task: 각 case 의 score β†’ average. Holistic: Overall quality (LLM judge). β†’ λ‘˜ λ‹€. ``` ### Live eval (production) ```ts // 1% sampling β€” production traffic if (Math.random() < 0.01) { await sampleForEval(input, output); } // Daily batch eval const samples = await db.evalSamples.recent(1000); await runEval(samples); ``` β†’ Drift detection. ### Eval-driven workflow ``` 1. μˆ˜μ§‘ cases (production logs) 2. Score 채점 3. Eval μž‘μ„± 4. Baseline μΈ‘μ • 5. Prompt / model / fine-tune λ³€κ²½ 6. Eval 비ꡐ 7. Better β†’ ship. Worse β†’ fix. ``` ### Cost-aware eval ```ts // Model 비ꡐ β€” 정확도 vs λΉ„μš© const results = { 'gpt-4o': { score: 0.92, cost: 0.005 }, 'gpt-4o-mini': { score: 0.85, cost: 0.0003 }, 'claude-haiku': { score: 0.88, cost: 0.0008 }, }; // $/quality 점수 ``` ### Anthropic Tool β€” Skills + Eval ``` .claude/skills/customer-support/eval.yaml β†’ λ§€ PR κ°€ μžλ™ eval. ``` ## πŸ€” μ˜μ‚¬κ²°μ • κΈ°μ€€ | 상황 | μΆ”μ²œ | |---|---| | OSS / λΉ λ₯Έ μ‹œμž‘ | Promptfoo | | Agent / 볡작 trace | Braintrust / LangSmith | | Safety eval | Inspect AI | | Self-host | Promptfoo | | Quick A/B | Promptfoo CLI | | Production observability | LangSmith / Helicone | ## ❌ μ•ˆν‹°νŒ¨ν„΄ - **Eval μ—†λŠ” λ³€κ²½**: νšŒκ·€. - **단일 case 만 (5개)**: variance 큰. 50+. - **LLM-as-judge 같은 λͺ¨λΈ**: 자기 편ν–₯. - **Test set leak (training)**: κ±°μ§“ 점수. - **Cost / latency λ¬΄μ‹œ**: μ •ν™•λ„λ§Œ 보면 비싸짐. - **CI 톡합 μ•ˆ 함**: drift κ²€μΆœ X. - **Production live data 무 sampling**: λΉ„μš©. ## πŸ€– LLM ν™œμš© 힌트 - Promptfoo = OSS λΉ λ₯Έ μ‹œμž‘. - Braintrust / LangSmith = production observability. - Pairwise > absolute. - Regression detection CI. ## πŸ”— κ΄€λ ¨ λ¬Έμ„œ - [[AI_LLM_Eval_Patterns]] - [[AI_Prompt_Engineering_Patterns]] - [[AI_LLM_Cost_Optimization]]