7.6 KiB
7.6 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-eval-framework-deep | LLM Eval Framework — Inspect / Promptfoo / Braintrust | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
LLM Eval Framework
Eval-driven development. Inspect AI (UK AISI), Promptfoo (OSS), Braintrust (managed), LangSmith (LangChain). Dataset + scorer + 비교.
📖 핵심 개념
- Dataset: input + expected.
- Scorer: 채점 (exact / similarity / LLM judge).
- Run: model × prompt × dataset.
- Trace: 각 case 의 실행 추적.
💻 코드 패턴
Inspect AI (Python, UK AISI)
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate
@task
def my_eval():
return Task(
dataset=[
Sample(input='Capital of France?', target='Paris'),
Sample(input='Capital of Korea?', target='Seoul'),
],
plan=[generate()],
scorer=match(),
)
# 실행
eval(my_eval(), model='anthropic/claude-opus-4-7')
→ AI safety 평가 강력.
Promptfoo (TS / OSS)
# promptfooconfig.yaml
description: "Customer support eval"
prompts:
- "Answer the customer's question concisely:\n{{question}}"
providers:
- openai:gpt-4o
- openai:gpt-4o-mini
- anthropic:claude-opus-4-7
- anthropic:claude-haiku-4-5
tests:
- vars: { question: "How do I reset my password?" }
assert:
- type: contains
value: "/forgot-password"
- type: llm-rubric
value: "Provides clear step-by-step instructions"
- type: latency
threshold: 3000
- type: cost
threshold: 0.005
- vars: { question: "Refund policy?" }
assert:
- type: contains-any
value: ["30 days", "money back", "refund"]
defaultTest:
options:
cache: true
promptfoo eval
promptfoo view # web UI 비교
Promptfoo programmatic
import { evaluate } from 'promptfoo';
const result = await evaluate({
prompts: ['Answer: {{q}}'],
providers: ['openai:gpt-4o'],
tests: [
{ vars: { q: 'capital of France' }, assert: [{ type: 'contains', value: 'Paris' }] },
],
});
console.log(result.results.passCount, '/', result.results.length);
Braintrust (managed, modern)
import { Eval } from 'braintrust';
await Eval('My Project', {
data: () => [
{ input: 'Capital of France?', expected: 'Paris' },
{ input: 'Capital of Korea?', expected: 'Seoul' },
],
task: async (input) => {
const r = await openai.chat.completions.create({...});
return r.choices[0].message.content!;
},
scores: [
Levenshtein,
LLMClassifier({
model: 'gpt-4o',
criteria: 'Does the answer contain the correct city?',
}),
],
});
→ Web UI 자동 + 비교 + regression detection.
LangSmith (LangChain)
import { Client } from 'langsmith';
const client = new Client();
// Dataset
await client.createExamples({
inputs: [{ question: 'Capital?' }],
outputs: [{ answer: 'Paris' }],
datasetId: 'capitals',
});
// Run + auto trace
import { evaluate } from 'langsmith/evaluation';
await evaluate(myAgent, {
data: 'capitals',
evaluators: [exactMatch],
});
LLM-as-judge (rubric)
async function judge(input: string, output: string, criteria: string) {
const r = await llm.complete({
system: `You are a strict evaluator. Score 1-5 based on criteria.
Output JSON: { "score": N, "reason": "..." }`,
user: `Input: ${input}\nOutput: ${output}\nCriteria: ${criteria}`,
response_format: { type: 'json_object' },
});
return JSON.parse(r);
}
await Eval(...).addScore({
name: 'helpful',
scorer: ({ input, output }) => judge(input, output, 'Is it helpful and concise?'),
});
Pairwise (A vs B)
async function pairwise(input: string, outA: string, outB: string) {
const r = await llm.complete({
user: `Compare A and B for query "${input}".\nA: ${outA}\nB: ${outB}\nWhich is better? JSON: { "winner": "A"|"B"|"tie", "reason": "..." }`,
response_format: { type: 'json_object' },
});
return JSON.parse(r);
}
→ Absolute score 보다 pairwise 가 사람 판단 align.
Regression detection
// CI 안 baseline 비교
const current = await runEval();
const baseline = await loadBaseline();
if (current.score < baseline.score - 0.05) {
console.error(`Regression: ${baseline.score} → ${current.score}`);
process.exit(1);
}
# CI
- name: LLM eval
run: promptfoo eval --output report.json
- name: Compare to baseline
run: node scripts/regression-check.js report.json
Trace + debug
// LangSmith / Braintrust trace
// 매 LLM call 의 input / output / token / latency / cost 자동 기록
// 실패 case → web UI 에서 step 별 inspect
Diverse dataset
- Edge cases (empty, very long, special chars)
- Adversarial (prompt injection)
- 다국어
- Real production logs (sampled)
- Synthetic (LLM 가 generate)
Synthetic data
async function generateTestCases(n: number) {
const r = await llm.complete({
user: `Generate ${n} customer support questions and ideal answers.
Output JSON: { "cases": [{ "question": "...", "answer": "..." }] }`,
response_format: { type: 'json_object' },
});
return JSON.parse(r).cases;
}
→ 빠른 dataset 시작.
Metrics 종류
- Exact match (binary): yes / no
- Levenshtein / similarity: 0-1
- BLEU / ROUGE: text similarity
- Semantic similarity: embedding cosine
- LLM-as-judge: 1-5 또는 binary
- Cost / latency: 비용 / 속도
- Custom: domain-specific
Per-task vs holistic
Per-task: 각 case 의 score → average.
Holistic: Overall quality (LLM judge).
→ 둘 다.
Live eval (production)
// 1% sampling — production traffic
if (Math.random() < 0.01) {
await sampleForEval(input, output);
}
// Daily batch eval
const samples = await db.evalSamples.recent(1000);
await runEval(samples);
→ Drift detection.
Eval-driven workflow
1. 수집 cases (production logs)
2. Score 채점
3. Eval 작성
4. Baseline 측정
5. Prompt / model / fine-tune 변경
6. Eval 비교
7. Better → ship. Worse → fix.
Cost-aware eval
// Model 비교 — 정확도 vs 비용
const results = {
'gpt-4o': { score: 0.92, cost: 0.005 },
'gpt-4o-mini': { score: 0.85, cost: 0.0003 },
'claude-haiku': { score: 0.88, cost: 0.0008 },
};
// $/quality 점수
Anthropic Tool — Skills + Eval
.claude/skills/customer-support/eval.yaml
→ 매 PR 가 자동 eval.
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| OSS / 빠른 시작 | Promptfoo |
| Agent / 복잡 trace | Braintrust / LangSmith |
| Safety eval | Inspect AI |
| Self-host | Promptfoo |
| Quick A/B | Promptfoo CLI |
| Production observability | LangSmith / Helicone |
❌ 안티패턴
- Eval 없는 변경: 회귀.
- 단일 case 만 (5개): variance 큰. 50+.
- LLM-as-judge 같은 모델: 자기 편향.
- Test set leak (training): 거짓 점수.
- Cost / latency 무시: 정확도만 보면 비싸짐.
- CI 통합 안 함: drift 검출 X.
- Production live data 무 sampling: 비용.
🤖 LLM 활용 힌트
- Promptfoo = OSS 빠른 시작.
- Braintrust / LangSmith = production observability.
- Pairwise > absolute.
- Regression detection CI.