Files
2nd/10_Wiki/Topics/Coding/AI_Eval_Framework_Deep.md
T
2026-05-09 21:08:02 +09:00

7.6 KiB
Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-eval-framework-deep LLM Eval Framework — Inspect / Promptfoo / Braintrust Coding draft B conceptual 2026-05-09 2026-05-09
ai
llm
eval
framework
vibe-coding
language applicable_to
TS / Python
Backend
Inspect AI
Promptfoo
Braintrust
LangSmith
Helicone
eval-driven development

LLM Eval Framework

Eval-driven development. Inspect AI (UK AISI), Promptfoo (OSS), Braintrust (managed), LangSmith (LangChain). Dataset + scorer + 비교.

📖 핵심 개념

  • Dataset: input + expected.
  • Scorer: 채점 (exact / similarity / LLM judge).
  • Run: model × prompt × dataset.
  • Trace: 각 case 의 실행 추적.

💻 코드 패턴

Inspect AI (Python, UK AISI)

from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

@task
def my_eval():
    return Task(
        dataset=[
            Sample(input='Capital of France?', target='Paris'),
            Sample(input='Capital of Korea?', target='Seoul'),
        ],
        plan=[generate()],
        scorer=match(),
    )

# 실행
eval(my_eval(), model='anthropic/claude-opus-4-7')

→ AI safety 평가 강력.

Promptfoo (TS / OSS)

# promptfooconfig.yaml
description: "Customer support eval"

prompts:
  - "Answer the customer's question concisely:\n{{question}}"

providers:
  - openai:gpt-4o
  - openai:gpt-4o-mini
  - anthropic:claude-opus-4-7
  - anthropic:claude-haiku-4-5

tests:
  - vars: { question: "How do I reset my password?" }
    assert:
      - type: contains
        value: "/forgot-password"
      - type: llm-rubric
        value: "Provides clear step-by-step instructions"
      - type: latency
        threshold: 3000
      - type: cost
        threshold: 0.005
  - vars: { question: "Refund policy?" }
    assert:
      - type: contains-any
        value: ["30 days", "money back", "refund"]

defaultTest:
  options:
    cache: true
promptfoo eval
promptfoo view  # web UI 비교

Promptfoo programmatic

import { evaluate } from 'promptfoo';

const result = await evaluate({
  prompts: ['Answer: {{q}}'],
  providers: ['openai:gpt-4o'],
  tests: [
    { vars: { q: 'capital of France' }, assert: [{ type: 'contains', value: 'Paris' }] },
  ],
});

console.log(result.results.passCount, '/', result.results.length);

Braintrust (managed, modern)

import { Eval } from 'braintrust';

await Eval('My Project', {
  data: () => [
    { input: 'Capital of France?', expected: 'Paris' },
    { input: 'Capital of Korea?', expected: 'Seoul' },
  ],
  task: async (input) => {
    const r = await openai.chat.completions.create({...});
    return r.choices[0].message.content!;
  },
  scores: [
    Levenshtein,
    LLMClassifier({
      model: 'gpt-4o',
      criteria: 'Does the answer contain the correct city?',
    }),
  ],
});

→ Web UI 자동 + 비교 + regression detection.

LangSmith (LangChain)

import { Client } from 'langsmith';
const client = new Client();

// Dataset
await client.createExamples({
  inputs: [{ question: 'Capital?' }],
  outputs: [{ answer: 'Paris' }],
  datasetId: 'capitals',
});

// Run + auto trace
import { evaluate } from 'langsmith/evaluation';
await evaluate(myAgent, {
  data: 'capitals',
  evaluators: [exactMatch],
});

LLM-as-judge (rubric)

async function judge(input: string, output: string, criteria: string) {
  const r = await llm.complete({
    system: `You are a strict evaluator. Score 1-5 based on criteria.
Output JSON: { "score": N, "reason": "..." }`,
    user: `Input: ${input}\nOutput: ${output}\nCriteria: ${criteria}`,
    response_format: { type: 'json_object' },
  });
  return JSON.parse(r);
}

await Eval(...).addScore({
  name: 'helpful',
  scorer: ({ input, output }) => judge(input, output, 'Is it helpful and concise?'),
});

Pairwise (A vs B)

async function pairwise(input: string, outA: string, outB: string) {
  const r = await llm.complete({
    user: `Compare A and B for query "${input}".\nA: ${outA}\nB: ${outB}\nWhich is better? JSON: { "winner": "A"|"B"|"tie", "reason": "..." }`,
    response_format: { type: 'json_object' },
  });
  return JSON.parse(r);
}

→ Absolute score 보다 pairwise 가 사람 판단 align.

Regression detection

// CI 안 baseline 비교
const current = await runEval();
const baseline = await loadBaseline();

if (current.score < baseline.score - 0.05) {
  console.error(`Regression: ${baseline.score}${current.score}`);
  process.exit(1);
}
# CI
- name: LLM eval
  run: promptfoo eval --output report.json
- name: Compare to baseline
  run: node scripts/regression-check.js report.json

Trace + debug

// LangSmith / Braintrust trace
// 매 LLM call 의 input / output / token / latency / cost 자동 기록

// 실패 case → web UI 에서 step 별 inspect

Diverse dataset

- Edge cases (empty, very long, special chars)
- Adversarial (prompt injection)
- 다국어
- Real production logs (sampled)
- Synthetic (LLM 가 generate)

Synthetic data

async function generateTestCases(n: number) {
  const r = await llm.complete({
    user: `Generate ${n} customer support questions and ideal answers.
Output JSON: { "cases": [{ "question": "...", "answer": "..." }] }`,
    response_format: { type: 'json_object' },
  });
  return JSON.parse(r).cases;
}

→ 빠른 dataset 시작.

Metrics 종류

- Exact match (binary): yes / no
- Levenshtein / similarity: 0-1
- BLEU / ROUGE: text similarity
- Semantic similarity: embedding cosine
- LLM-as-judge: 1-5 또는 binary
- Cost / latency: 비용 / 속도
- Custom: domain-specific

Per-task vs holistic

Per-task:    각 case 의 score → average.
Holistic:    Overall quality (LLM judge).

→ 둘 다.

Live eval (production)

// 1% sampling — production traffic
if (Math.random() < 0.01) {
  await sampleForEval(input, output);
}

// Daily batch eval
const samples = await db.evalSamples.recent(1000);
await runEval(samples);

→ Drift detection.

Eval-driven workflow

1. 수집 cases (production logs)
2. Score 채점
3. Eval 작성
4. Baseline 측정
5. Prompt / model / fine-tune 변경
6. Eval 비교
7. Better → ship. Worse → fix.

Cost-aware eval

// Model 비교 — 정확도 vs 비용
const results = {
  'gpt-4o': { score: 0.92, cost: 0.005 },
  'gpt-4o-mini': { score: 0.85, cost: 0.0003 },
  'claude-haiku': { score: 0.88, cost: 0.0008 },
};

// $/quality 점수

Anthropic Tool — Skills + Eval

.claude/skills/customer-support/eval.yaml
→ 매 PR 가 자동 eval.

🤔 의사결정 기준

상황 추천
OSS / 빠른 시작 Promptfoo
Agent / 복잡 trace Braintrust / LangSmith
Safety eval Inspect AI
Self-host Promptfoo
Quick A/B Promptfoo CLI
Production observability LangSmith / Helicone

안티패턴

  • Eval 없는 변경: 회귀.
  • 단일 case 만 (5개): variance 큰. 50+.
  • LLM-as-judge 같은 모델: 자기 편향.
  • Test set leak (training): 거짓 점수.
  • Cost / latency 무시: 정확도만 보면 비싸짐.
  • CI 통합 안 함: drift 검출 X.
  • Production live data 무 sampling: 비용.

🤖 LLM 활용 힌트

  • Promptfoo = OSS 빠른 시작.
  • Braintrust / LangSmith = production observability.
  • Pairwise > absolute.
  • Regression detection CI.

🔗 관련 문서