---
id: wiki-2026-0508-ai-evaluation-benchmarks
title: AI Evaluation & Benchmarks
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [LLM eval, model benchmark, MMLU, HumanEval, SWE-bench, Chatbot Arena, NIAH, RULER]
duplicate_of: none
source_trust_level: B
confidence_score: 0.9
verification_status: conceptual
tags: [llm-eval, benchmark, mmlu, humaneval, swe-bench, chatbot-arena, niah, contamination, ai-quality]
raw_sources: []
last_reinforced: 2026-05-09
github_commit: pending
inferred_by: Claude Opus 4.7 (manual cleanup 2026-05-09)
tech_stack:
  language: Python / TS
  framework: Promptfoo / LangSmith / Inspect / lm-eval-harness
---

# AI Evaluation & Benchmarks

## 📌 한 줄 통찰 (The Karpathy Summary)
> **"좋다" vs "측정"**. 매 capability (math, code, reasoning, long-context, tool use) 의 standardized test. 단점: contamination, Goodhart's law, eval ≠ real-world. Modern = LMSys Arena (human pref) + SWE-bench (real task) + custom domain eval.

## 📖 구조화된 지식 (Synthesized Content)

### Benchmark 의 family

#### 1. Knowledge / 추론
| Benchmark | 측정 | Note |
|---|---|---|
| **MMLU** (57 subject) | 다영역 지식 | 가장 인기. 90%+ saturated. |
| **MMLU-Pro** | MMLU 확장, 더 어려움 | 50% 정도 가 frontier. |
| **GPQA** | PhD-level science | 잘 saturated 안 됨. |
| **HellaSwag** | 상식 추론 | 옛, saturated. |
| **ARC-AGI** | Pattern reasoning | OpenAI o3 가 75% (인간 = 85%). |

#### 2. Math
| Benchmark | 측정 |
|---|---|
| **GSM8K** | 초등 multi-step | Saturated (95%+). |
| **MATH** | 경시대회 problem | Frontier 70-90%. |
| **AIME** | American math olympiad | Hard. o1/R1 가 잘. |
| **FrontierMath** | Research-level | <5% saturate. |

#### 3. Code
| Benchmark | 측정 |
|---|---|
| **HumanEval** | Python 함수 생성 | Saturated (95%+). |
| **MBPP** | Python coding | Saturated. |
| **SWE-bench** | Real GitHub issue | Frontier ~50-60%. |
| **SWE-bench Verified** | Curated subset | More reliable. |
| **BigCodeBench** | Complex Python | Frontier ~30-50%. |
| **LiveCodeBench** | Recent (LeetCode) | 매월 update (contamination 방지). |

#### 4. Long context
| Benchmark | 측정 |
|---|---|
| **NIAH (Needle in a Haystack)** | "needle" sentence 의 retrieval | Trivial 가 됨 — too easy. |
| **RULER** | Multi-needle, summarize, multi-hop | More realistic. |
| **LongBench** | Long doc QA |  |
| **Loong** | Multi-doc reasoning |  |

#### 5. Agent / tool
| Benchmark | 측정 |
|---|---|
| **GAIA** | Real-world tasks (web, file) | Frontier ~30%. |
| **SWE-bench** | Code agent | Devin / Cursor benchmark. |
| **WebArena / VisualWebArena** | Browser agent | <30% saturate. |
| **MCP-Atlas** | Tool composition |  |
| **τ-bench** | Customer service simulation |  |

#### 6. Real-world / human pref
| Benchmark | 측정 |
|---|---|
| **LMSYS Chatbot Arena** | Blind A/B + Elo | Most trusted real-world signal. |
| **MT-Bench** | Multi-turn quality (LLM-judge) |  |
| **AlpacaEval** | LLM-judge |  |
| **Vibes** | Subjective pref (community) |  |

#### 7. Safety / alignment
| Benchmark | 측정 |
|---|---|
| **TruthfulQA** | 거짓 안 말함 |  |
| **HarmBench** | Refuse harmful |  |
| **Anthropic Persuasion** |  |
| **Constitutional AI eval** |  |

### 함정 (Goodhart's Law in AI)
1. **Contamination**: train data 가 benchmark 가 leak → 가짜 high score. 매 frontier model 의 의심.
2. **Overfitting**: 매 release 의 specific benchmark optimization.
3. **"솔루션 lookup"**: GSM8K 의 Q 가 train data 에. Model 가 reasoning X, retrieval.
4. **Synthetic data 의 saturation**: 같은 LLM 가 만든 Q 의 같은 LLM 가 풀어.
5. **Real-world ≠ benchmark**: high score + bad UX 의 흔함.
6. **Subjective**: chatbot quality 의 measure 가 tricky.

→ Benchmark 의 lifecycle: 새 → 의미 → saturated → 의미 X → retire.

### 새 benchmark 의 trend
- **Live / dynamic** (LiveCodeBench, ARC-AGI): 매월 update.
- **Verified** (SWE-bench Verified): human-curated.
- **Real task** (GAIA, τ-bench): 실제 work.
- **Human pref** (Arena): hard to game.
- **Domain-specific**: medical (MedQA), legal (LegalBench), scientific.

## 💻 코드 패턴 (Code Patterns)

### lm-eval-harness (EleutherAI 표준)
```bash
pip install lm-eval

# Run benchmark
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-8B \
    --tasks mmlu,gsm8k,humaneval \
    --batch_size 8

# 결과 = JSON
```

### Promptfoo (custom eval)
```yaml
# promptfooconfig.yaml
prompts:
  - 'Solve this math problem: {{problem}}'

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-haiku-4-5

tests:
  - vars:
      problem: 'If a train travels 60 mph for 2 hours, how far?'
    assert:
      - type: contains
        value: '120'
```

```bash
promptfoo eval
```

### LangSmith eval
```python
from langsmith import Client
from langchain.smith import RunEvalConfig

client = Client()
results = client.run_on_dataset(
    dataset_name='math-questions',
    llm_or_chain=chain,
    evaluation=RunEvalConfig(evaluators=['qa', 'context_qa']),
)
```

### LLM-as-judge
```python
def judge(question, answer, expected):
    prompt = f'''
Score the answer on 1-10 scale.

Question: {question}
Expected: {expected}
Answer: {answer}

Output JSON: {{"score": N, "reason": "..."}}
'''
    return json.loads(judge_llm.complete(prompt))
```

→ Cheap + scale. Bias 위험 (same model 이 자체 평가 가 bias).

### Custom benchmark 작성
```python
import json

# Golden set
test_cases = [
    {'input': 'What is 2+2?', 'expected': '4'},
    {'input': 'Capital of France?', 'expected': 'Paris'},
    # ... 100+
]

def evaluate(model):
    correct = 0
    for case in test_cases:
        answer = model.complete(case['input'])
        if match(answer, case['expected']):
            correct += 1
    return correct / len(test_cases)
```

### Inspect (UK AISI)
```python
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

@task
def my_task():
    return Task(
        dataset=[
            Sample(input='Capital of France?', target='Paris'),
            Sample(input='What is 2+2?', target='4'),
        ],
        plan=[generate()],
        scorer=match(),
    )

eval(my_task(), model='openai/gpt-4o-mini')
```

→ AISI / safety-focused.

### Contamination check
```python
# n-gram overlap (낮은 = OK)
def check_contamination(test_set, train_set, n=8):
    train_ngrams = set()
    for doc in train_set:
        tokens = doc.split()
        for i in range(len(tokens) - n + 1):
            train_ngrams.add(tuple(tokens[i:i+n]))
    
    overlapping = 0
    for q in test_set:
        tokens = q.split()
        for i in range(len(tokens) - n + 1):
            if tuple(tokens[i:i+n]) in train_ngrams:
                overlapping += 1
                break
    
    return overlapping / len(test_set)
```

→ 5%+ overlap = 의심.

### Domain-specific eval (예: 의료)
```python
# MedQA-style
test = [
    {
        'q': 'Patient has fever, cough, fatigue. Most likely?',
        'options': ['flu', 'covid', 'allergies', 'cancer'],
        'correct': 'flu' or 'covid' (context-dep),
    },
]

# Score = top-1 또는 top-2 accuracy.
```

### Continuous eval (production)
```python
@trace
def chat(query):
    response = llm.complete(query)
    log({'query': query, 'response': response, 'tokens': ...})
    return response

# Daily:
# 1. Sample 100 production query.
# 2. LLM-judge score.
# 3. Trend over time.
```

→ Drift detect.

## 🤔 의사결정 기준 (Decision Criteria)

| 작업 | Benchmark |
|---|---|
| Generic capability | MMLU + GSM8K + HumanEval |
| Long context | RULER (NIAH 가 too easy) |
| Real-world coding | SWE-bench Verified |
| Real-world agent | GAIA / τ-bench |
| Human-perceived quality | LMSys Arena Elo |
| Math reasoning | AIME / FrontierMath |
| Domain (의료, 법) | Domain-specific (MedQA, LegalBench) |
| Production app | Custom golden set + LLM-judge |
| Safety | TruthfulQA + HarmBench |

**기본값**: Custom domain eval (production traffic) + Promptfoo CI gate. 매 release 의 regression 검증.

## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **Saturation 빠름**: MMLU 90% saturated. 매 6 month 의 새 benchmark 필요.
- **Real-world 차이**: high benchmark + bad UX 흔함. Production eval 가 더 중요.
- **Contamination 의 epidemic**: 매 frontier model 의 의심. Live benchmark (LiveCodeBench) 가 답.
- **Bench shopping**: vendor 가 자기 best benchmark 만 publish. 매 case 의 cherry-pick.
- **Multi-modal**: text 만 X. Image (MMMU), video (Video-MME), audio.
- **Reasoning trace 의 eval**: o1 / R1 의 chain-of-thought 의 quality 측정 = 새 challenge.

## 🔗 지식 연결 (Graph)
- 변형: [[LLM-as-Judge]]
- Tools: lm-eval-harness · Promptfoo · LangSmith · Inspect (AISI) · Braintrust · Helicone · Langfuse
- Related: [[Code Agent — Devin / Cursor / Claude Code]]

## 🤖 LLM 활용 힌트 (How to Use This Knowledge)

**언제 이 지식을 쓰는가:**
- 새 LLM 의 quality 비교 (어떤 model 사용 결정).
- Production system 의 release gate 의 eval 디자인.
- 매 prompt 의 변경 시 regression 검증.
- Domain-specific application 의 quality 측정.
- Vendor 의 marketing claim 의 reality check.

**언제 쓰면 안 되는가:**
- Benchmark 만 의존 (real user feedback 없이).
- Single benchmark + decision (overfit risk).
- Contaminated benchmark + 신뢰.
- 비싼 frontier model 의 작은 task (overkill).
- Domain eval 없이 generic 만 (production fail).

## ❌ 안티패턴 (Anti-Patterns)
- **Single benchmark + claim "best"**: cherry-pick. Multi-benchmark.
- **Contamination check 안 함**: 가짜 score.
- **Static benchmark + 매년**: saturation = 의미 X.
- **No human eval**: LLM-judge 만 = bias.
- **No production eval**: benchmark vs reality gap.
- **Benchmark 가 train data**: model 의 dishonest.
- **Eval cost 무시**: GPT-4 judge × 10k case = $$.
- **Saturated benchmark 보고 model 의 ceiling 추정**: 매 model 의 ceiling 의 misjudge.

## 🧪 검증 상태 (Validation)
- **정보 상태:** verified (concept-level).
- **출처 신뢰도:** B (Hugging Face leaderboard, Stanford HAI report, Papers With Code).
- **검토 이유:** Manual cleanup. 매 specific benchmark 의 number 가 매월 change. 매 6 month review 추천.

## 🧬 중복 검사 (Duplicate Check)
- **기존 유사 문서:** [[LLM-Capabilities]] (related), [[Continuous-Learning-System]] (production eval), [[AI_Eval_Framework_Modern]] (tools).
- **처리 방식:** KEEP (overview of benchmarks).
- **처리 이유:** Tool / framework 와 의 separate. 매 benchmark 의 detail.

## 🕓 변경 이력 (Changelog)
| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
|------|-----------|-----------|--------|
| 2026-05-08 | P-Reinforce Phase 1 정규화 | UPDATE | A |
| 2026-05-09 | Manual cleanup — code pattern + benchmark family + 의사결정 + 안티패턴 추가 | UPDATE | B |