Files
2nd/10_Wiki/Topics/AI_and_ML/AI Evaluation & Benchmarks.md
T
2026-05-10 22:08:15 +09:00

11 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, inferred_by, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit inferred_by tech_stack
wiki-2026-0508-ai-evaluation-benchmarks AI Evaluation & Benchmarks 10_Wiki/Topics verified self
LLM eval
model benchmark
MMLU
HumanEval
SWE-bench
Chatbot Arena
NIAH
RULER
none B 0.9 conceptual
llm-eval
benchmark
mmlu
humaneval
swe-bench
chatbot-arena
niah
contamination
ai-quality
2026-05-09 pending Claude Opus 4.7 (manual cleanup 2026-05-09)
language framework
Python / TS Promptfoo / LangSmith / Inspect / lm-eval-harness

AI Evaluation & Benchmarks

📌 한 줄 통찰 (The Karpathy Summary)

"좋다" vs "측정". 매 capability (math, code, reasoning, long-context, tool use) 의 standardized test. 단점: contamination, Goodhart's law, eval ≠ real-world. Modern = LMSys Arena (human pref) + SWE-bench (real task) + custom domain eval.

📖 구조화된 지식 (Synthesized Content)

Benchmark 의 family

1. Knowledge / 추론

Benchmark 측정 Note
MMLU (57 subject) 다영역 지식 가장 인기. 90%+ saturated.
MMLU-Pro MMLU 확장, 더 어려움 50% 정도 가 frontier.
GPQA PhD-level science 잘 saturated 안 됨.
HellaSwag 상식 추론 옛, saturated.
ARC-AGI Pattern reasoning OpenAI o3 가 75% (인간 = 85%).

2. Math

Benchmark 측정
GSM8K 초등 multi-step
MATH 경시대회 problem
AIME American math olympiad
FrontierMath Research-level

3. Code

Benchmark 측정
HumanEval Python 함수 생성
MBPP Python coding
SWE-bench Real GitHub issue
SWE-bench Verified Curated subset
BigCodeBench Complex Python
LiveCodeBench Recent (LeetCode)

4. Long context

Benchmark 측정
NIAH (Needle in a Haystack) "needle" sentence 의 retrieval
RULER Multi-needle, summarize, multi-hop
LongBench Long doc QA
Loong Multi-doc reasoning

5. Agent / tool

Benchmark 측정
GAIA Real-world tasks (web, file)
SWE-bench Code agent
WebArena / VisualWebArena Browser agent
MCP-Atlas Tool composition
τ-bench Customer service simulation

6. Real-world / human pref

Benchmark 측정
LMSYS Chatbot Arena Blind A/B + Elo
MT-Bench Multi-turn quality (LLM-judge)
AlpacaEval LLM-judge
Vibes Subjective pref (community)

7. Safety / alignment

Benchmark 측정
TruthfulQA 거짓 안 말함
HarmBench Refuse harmful
Anthropic Persuasion
Constitutional AI eval

함정 (Goodhart's Law in AI)

  1. Contamination: train data 가 benchmark 가 leak → 가짜 high score. 매 frontier model 의 의심.
  2. Overfitting: 매 release 의 specific benchmark optimization.
  3. "솔루션 lookup": GSM8K 의 Q 가 train data 에. Model 가 reasoning X, retrieval.
  4. Synthetic data 의 saturation: 같은 LLM 가 만든 Q 의 같은 LLM 가 풀어.
  5. Real-world ≠ benchmark: high score + bad UX 의 흔함.
  6. Subjective: chatbot quality 의 measure 가 tricky.

→ Benchmark 의 lifecycle: 새 → 의미 → saturated → 의미 X → retire.

새 benchmark 의 trend

  • Live / dynamic (LiveCodeBench, ARC-AGI): 매월 update.
  • Verified (SWE-bench Verified): human-curated.
  • Real task (GAIA, τ-bench): 실제 work.
  • Human pref (Arena): hard to game.
  • Domain-specific: medical (MedQA), legal (LegalBench), scientific.

💻 코드 패턴 (Code Patterns)

lm-eval-harness (EleutherAI 표준)

pip install lm-eval

# Run benchmark
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-8B \
    --tasks mmlu,gsm8k,humaneval \
    --batch_size 8

# 결과 = JSON

Promptfoo (custom eval)

# promptfooconfig.yaml
prompts:
  - 'Solve this math problem: {{problem}}'

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-haiku-4-5

tests:
  - vars:
      problem: 'If a train travels 60 mph for 2 hours, how far?'
    assert:
      - type: contains
        value: '120'
promptfoo eval

LangSmith eval

from langsmith import Client
from langchain.smith import RunEvalConfig

client = Client()
results = client.run_on_dataset(
    dataset_name='math-questions',
    llm_or_chain=chain,
    evaluation=RunEvalConfig(evaluators=['qa', 'context_qa']),
)

LLM-as-judge

def judge(question, answer, expected):
    prompt = f'''
Score the answer on 1-10 scale.

Question: {question}
Expected: {expected}
Answer: {answer}

Output JSON: {{"score": N, "reason": "..."}}
'''
    return json.loads(judge_llm.complete(prompt))

→ Cheap + scale. Bias 위험 (same model 이 자체 평가 가 bias).

Custom benchmark 작성

import json

# Golden set
test_cases = [
    {'input': 'What is 2+2?', 'expected': '4'},
    {'input': 'Capital of France?', 'expected': 'Paris'},
    # ... 100+
]

def evaluate(model):
    correct = 0
    for case in test_cases:
        answer = model.complete(case['input'])
        if match(answer, case['expected']):
            correct += 1
    return correct / len(test_cases)

Inspect (UK AISI)

from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

@task
def my_task():
    return Task(
        dataset=[
            Sample(input='Capital of France?', target='Paris'),
            Sample(input='What is 2+2?', target='4'),
        ],
        plan=[generate()],
        scorer=match(),
    )

eval(my_task(), model='openai/gpt-4o-mini')

→ AISI / safety-focused.

Contamination check

# n-gram overlap (낮은 = OK)
def check_contamination(test_set, train_set, n=8):
    train_ngrams = set()
    for doc in train_set:
        tokens = doc.split()
        for i in range(len(tokens) - n + 1):
            train_ngrams.add(tuple(tokens[i:i+n]))
    
    overlapping = 0
    for q in test_set:
        tokens = q.split()
        for i in range(len(tokens) - n + 1):
            if tuple(tokens[i:i+n]) in train_ngrams:
                overlapping += 1
                break
    
    return overlapping / len(test_set)

→ 5%+ overlap = 의심.

Domain-specific eval (예: 의료)

# MedQA-style
test = [
    {
        'q': 'Patient has fever, cough, fatigue. Most likely?',
        'options': ['flu', 'covid', 'allergies', 'cancer'],
        'correct': 'flu' or 'covid' (context-dep),
    },
]

# Score = top-1 또는 top-2 accuracy.

Continuous eval (production)

@trace
def chat(query):
    response = llm.complete(query)
    log({'query': query, 'response': response, 'tokens': ...})
    return response

# Daily:
# 1. Sample 100 production query.
# 2. LLM-judge score.
# 3. Trend over time.

→ Drift detect.

🤔 의사결정 기준 (Decision Criteria)

작업 Benchmark
Generic capability MMLU + GSM8K + HumanEval
Long context RULER (NIAH 가 too easy)
Real-world coding SWE-bench Verified
Real-world agent GAIA / τ-bench
Human-perceived quality LMSys Arena Elo
Math reasoning AIME / FrontierMath
Domain (의료, 법) Domain-specific (MedQA, LegalBench)
Production app Custom golden set + LLM-judge
Safety TruthfulQA + HarmBench

기본값: Custom domain eval (production traffic) + Promptfoo CI gate. 매 release 의 regression 검증.

⚠️ 모순 및 업데이트 (Contradictions & Updates)

  • Saturation 빠름: MMLU 90% saturated. 매 6 month 의 새 benchmark 필요.
  • Real-world 차이: high benchmark + bad UX 흔함. Production eval 가 더 중요.
  • Contamination 의 epidemic: 매 frontier model 의 의심. Live benchmark (LiveCodeBench) 가 답.
  • Bench shopping: vendor 가 자기 best benchmark 만 publish. 매 case 의 cherry-pick.
  • Multi-modal: text 만 X. Image (MMMU), video (Video-MME), audio.
  • Reasoning trace 의 eval: o1 / R1 의 chain-of-thought 의 quality 측정 = 새 challenge.

🔗 지식 연결 (Graph)

🤖 LLM 활용 힌트 (How to Use This Knowledge)

언제 이 지식을 쓰는가:

  • 새 LLM 의 quality 비교 (어떤 model 사용 결정).
  • Production system 의 release gate 의 eval 디자인.
  • 매 prompt 의 변경 시 regression 검증.
  • Domain-specific application 의 quality 측정.
  • Vendor 의 marketing claim 의 reality check.

언제 쓰면 안 되는가:

  • Benchmark 만 의존 (real user feedback 없이).
  • Single benchmark + decision (overfit risk).
  • Contaminated benchmark + 신뢰.
  • 비싼 frontier model 의 작은 task (overkill).
  • Domain eval 없이 generic 만 (production fail).

안티패턴 (Anti-Patterns)

  • Single benchmark + claim "best": cherry-pick. Multi-benchmark.
  • Contamination check 안 함: 가짜 score.
  • Static benchmark + 매년: saturation = 의미 X.
  • No human eval: LLM-judge 만 = bias.
  • No production eval: benchmark vs reality gap.
  • Benchmark 가 train data: model 의 dishonest.
  • Eval cost 무시: GPT-4 judge × 10k case = $$.
  • Saturated benchmark 보고 model 의 ceiling 추정: 매 model 의 ceiling 의 misjudge.

🧪 검증 상태 (Validation)

  • 정보 상태: verified (concept-level).
  • 출처 신뢰도: B (Hugging Face leaderboard, Stanford HAI report, Papers With Code).
  • 검토 이유: Manual cleanup. 매 specific benchmark 의 number 가 매월 change. 매 6 month review 추천.

🧬 중복 검사 (Duplicate Check)

🕓 변경 이력 (Changelog)

날짜 변경 내용 처리 방식 신뢰도
2026-05-08 P-Reinforce Phase 1 정규화 UPDATE A
2026-05-09 Manual cleanup — code pattern + benchmark family + 의사결정 + 안티패턴 추가 UPDATE B