Files
2nd/10_Wiki/Topics/Coding/AI_LLM_Eval_Patterns.md
T
2026-05-09 21:08:02 +09:00

4.8 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-llm-eval-patterns LLM Evaluation — Golden Set / LLM-as-Judge / 회귀 Coding draft B conceptual 2026-05-09 2026-05-09
ai
llm
eval
testing
vibe-coding
language applicable_to
TS / Python
Backend
LLM eval
golden dataset
LLM-as-judge
regression
Promptfoo
Braintrust

LLM Evaluation

"느낌상 좋아짐" 은 측정 X. golden dataset + 자동 채점. Prompt 변경 / 모델 변경 시 회귀 검출. Promptfoo / Braintrust / LangSmith.

📖 핵심 개념

  • Golden set: input + expected output 쌍.
  • Metric: exact match / similarity / structured / LLM-as-judge.
  • Eval = unit test for LLM. 매 PR 마다 실행.
  • LLM-as-judge: 정답이 자유 형식일 때 다른 LLM 이 채점.

💻 코드 패턴

단순 자체 eval

const cases = [
  { input: '2+2', expected: '4' },
  { input: 'capital of France', expected: 'Paris' },
];

let pass = 0;
for (const c of cases) {
  const out = await callLLM(c.input);
  if (out.includes(c.expected)) pass++;
  else console.log('FAIL', c.input, '→', out);
}
console.log(`${pass}/${cases.length}`);

Promptfoo (yaml)

# promptfooconfig.yaml
prompts:
  - "Answer concisely: {{question}}"

providers:
  - openai:gpt-4o-mini
  - openai:gpt-4o
  - anthropic:claude-haiku-4-5

tests:
  - vars: { question: "Capital of France?" }
    assert:
      - type: contains
        value: "Paris"
      - type: latency
        threshold: 2000
      - type: cost
        threshold: 0.001

  - vars: { question: "Bank vault security tips" }
    assert:
      - type: llm-rubric
        value: "Lists at least 3 security measures, mentions surveillance"
promptfoo eval

LLM-as-judge

async function judge(input: string, output: string, criteria: string): Promise<{score: number, reason: string}> {
  const r = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'You are a strict evaluator. Score 0-5. Output JSON: {"score":N,"reason":"..."}' },
      { role: 'user', content: `Input: ${input}\nOutput: ${output}\nCriteria: ${criteria}` },
    ],
    response_format: { type: 'json_object' },
  });
  return JSON.parse(r.choices[0].message.content!);
}

Pairwise comparison (A vs B)

// 실험: 두 prompt 결과 — 어느 게 나은지
async function pairwise(input: string, outA: string, outB: string) {
  const r = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: `Compare A and B for "${input}".\nA: ${outA}\nB: ${outB}\nWhich is better and why? JSON: {"winner":"A"|"B"|"tie","reason":"..."}` }],
    response_format: { type: 'json_object' },
  });
  return JSON.parse(r.choices[0].message.content!);
}

Structured output 검증

import { Recipe } from './schemas';

const out = await callLLM(prompt);
const parsed = Recipe.safeParse(out);
expect(parsed.success).toBe(true);
if (!parsed.success) console.log(parsed.error);

Latency / cost 추적

const start = Date.now();
const r = await openai.chat.completions.create({...});
const ms = Date.now() - start;
const usage = r.usage!;
const cost = usage.prompt_tokens * 2.5e-6 + usage.completion_tokens * 1e-5;

track('llm.eval', { ms, cost, prompt_tokens: usage.prompt_tokens });

CI 회귀

# .github/workflows/llm-eval.yml
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm i
      - run: npx promptfoo eval --output report.json
      - run: node scripts/check-regression.js report.json
        # baseline 점수보다 5% 이상 하락 시 실패

🤔 의사결정 기준

출력 종류 채점
Exact answer exact match / contains
JSON / 구조 Schema parse
분류 accuracy / F1
자유 텍스트 LLM-as-judge / rouge / BLEU
비교 (어느 게 나아?) pairwise A/B
실제 사용자 신호 thumbs up/down / 재질문률

안티패턴

  • Eval 없이 prod 배포: 회귀 검출 불가.
  • Test set 작음 (5개): 변동 큼. 50+ 권장.
  • Test set leak (학습에 사용): 거짓 점수.
  • LLM-as-judge — 같은 모델로 채점: 자기 편향.
  • Cost / latency 무시: 정확도만 보면 비용 폭발.
  • Production 못 배포 — 매번 eval: 작은 hot-set 만 매 PR, 큰 건 nightly.
  • Subjective only — 자동화 X: 매번 사람 — 못 scale.

🤖 LLM 활용 힌트

  • Promptfoo / Braintrust / LangSmith 권장.
  • LLM-as-judge 는 다른 모델로.
  • 회귀 5% 임계값 + cost / latency 같이.

🔗 관련 문서