f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
268 lines
7.7 KiB
Markdown
268 lines
7.7 KiB
Markdown
---
|
|
id: wiki-2026-0508-benchmarks
|
|
title: Benchmarks (AI Evaluation)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [벤치마크, AI benchmarks, MMLU, HumanEval, MATH, GLUE, SuperGLUE, evaluation, leaderboard, Goodharts Law]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.93
|
|
verification_status: applied
|
|
tags: [benchmark, evaluation, mmlu, humaneval, math, swe-bench, contamination, leaderboard, helm]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: lm-evaluation-harness / HELM / OpenCompass
|
|
---
|
|
|
|
# Benchmarks
|
|
|
|
## 📌 한 줄 통찰
|
|
> **"지능 의 줄자"**. 매 standardized 의 same comparison. 매 milestone + 매 marketing. 매 Goodhart's Law (매 metric 의 target 의 saturate). 매 modern era 의 contamination 의 worry.
|
|
|
|
## 📖 핵심
|
|
|
|
### 매 NLP / LLM benchmark
|
|
|
|
#### General reasoning
|
|
- **MMLU** (57 subjects, multiple choice): 매 GPT 시대 의 standard.
|
|
- **MMLU-Pro** (2024): 매 harder, 매 contamination 의 fix.
|
|
- **GPQA** (graduate-level science): 매 hard.
|
|
- **BIG-Bench Hard**: 매 LLM 의 weak point.
|
|
- **AGIEval**: 매 SAT, GRE, LSAT.
|
|
|
|
#### Math
|
|
- **GSM8K** (grade school math): 매 saturated.
|
|
- **MATH** (competition): 매 hard.
|
|
- **AIME** / **IMO**: 매 frontier.
|
|
|
|
#### Code
|
|
- **HumanEval** (OpenAI): 매 saturated.
|
|
- **MBPP**: 매 basic Python.
|
|
- **SWE-bench** (Princeton): 매 real GitHub issue.
|
|
- **LiveCodeBench**: 매 contamination-aware.
|
|
|
|
#### Instruction following
|
|
- **AlpacaEval** / **MT-Bench**: 매 LLM-as-judge.
|
|
- **Arena (LMSYS)**: 매 human pairwise.
|
|
- **IFEval**: 매 verifiable instruction.
|
|
|
|
#### Long context
|
|
- **Needle in Haystack**: 매 retrieval.
|
|
- **RULER**: 매 multi-task.
|
|
- **InfiniteBench**.
|
|
|
|
#### Agentic / tool use
|
|
- **WebArena** / **GAIA**: 매 real task.
|
|
- **OSWorld**: 매 desktop GUI.
|
|
- **τ-bench** (tau-bench): 매 customer service.
|
|
|
|
#### Safety / alignment
|
|
- **TruthfulQA**: 매 honesty.
|
|
- **BBQ** (bias QA).
|
|
- **HarmBench** / **AdvBench**: 매 jailbreak.
|
|
- **MACHIAVELLI**: 매 power-seeking.
|
|
|
|
### 매 vision benchmark
|
|
- **ImageNet**: 매 classification.
|
|
- **COCO**: 매 detection / segmentation.
|
|
- **VQAv2**: 매 visual QA.
|
|
- **MMMU**: 매 multi-modal MMLU.
|
|
|
|
### 매 problem
|
|
|
|
#### Goodhart's Law
|
|
- "When a measure becomes a target, it ceases to be a good measure."
|
|
- 매 saturated benchmark = 매 model 의 game.
|
|
|
|
#### Data contamination
|
|
- 매 pretraining data 의 매 test set leak.
|
|
- 매 LLM 의 fake high score.
|
|
- → 매 LiveCodeBench, 매 MMLU-Pro 의 mitigate.
|
|
|
|
#### Construct validity
|
|
- 매 measured ≠ 매 wanted.
|
|
- 매 MMLU = 매 multiple-choice (real ≠).
|
|
|
|
#### Distribution shift
|
|
- 매 academic ≠ 매 real-world.
|
|
|
|
#### Evaluation cost
|
|
- 매 GPT-4 의 evaluation 의 expensive.
|
|
- 매 LLM-as-judge 의 bias.
|
|
|
|
### 매 modern best practice
|
|
1. **Multiple benchmark**: 매 single 의 game 의 detect.
|
|
2. **Held-out test**: 매 fresh.
|
|
3. **Contamination check**: 매 n-gram match.
|
|
4. **LLM-as-judge audit**: 매 self-bias.
|
|
5. **Human preference** (Arena): 매 ground truth.
|
|
6. **HELM** (Stanford): 매 holistic, 매 multi-axis.
|
|
7. **Specific task eval**: 매 internal benchmark.
|
|
|
|
## 💻 패턴
|
|
|
|
### lm-evaluation-harness (EleutherAI)
|
|
```bash
|
|
pip install lm-eval
|
|
|
|
lm_eval --model hf \
|
|
--model_args pretrained=meta-llama/Llama-3-8B \
|
|
--tasks mmlu,gsm8k,arc_challenge,truthfulqa \
|
|
--device cuda \
|
|
--batch_size 8
|
|
```
|
|
|
|
→ 매 standard 의 reproducible.
|
|
|
|
### HELM (Stanford)
|
|
```python
|
|
# 매 holistic evaluation
|
|
from helm.benchmark.run import run
|
|
|
|
scenarios = [
|
|
'mmlu',
|
|
'truthfulqa',
|
|
'bbq',
|
|
'real_toxicity_prompts',
|
|
'civil_comments',
|
|
]
|
|
run(model='openai/gpt-4', scenarios=scenarios)
|
|
```
|
|
|
|
### Custom internal benchmark
|
|
```python
|
|
def evaluate_custom(model, test_cases):
|
|
results = []
|
|
for case in test_cases:
|
|
response = model.generate(case.prompt)
|
|
score = case.judge(response) # 매 task-specific
|
|
results.append({
|
|
'case_id': case.id,
|
|
'score': score,
|
|
'response': response,
|
|
'expected': case.expected,
|
|
})
|
|
|
|
# 매 metric breakdown
|
|
by_category = group_by(results, 'category')
|
|
for cat, items in by_category.items():
|
|
print(f'{cat}: {sum(i["score"] for i in items)/len(items):.3f}')
|
|
|
|
return results
|
|
```
|
|
|
|
### LLM-as-judge (with calibration)
|
|
```python
|
|
def llm_judge(prompt, response, reference):
|
|
judge_prompt = f"""Compare the response against the reference.
|
|
Score 1-5 (5 = matches reference, 1 = wrong).
|
|
|
|
Prompt: {prompt}
|
|
Reference: {reference}
|
|
Response: {response}
|
|
|
|
Score: """
|
|
|
|
# 매 N=5 의 average (variance reduce)
|
|
scores = [parse_score(judge_model.generate(judge_prompt)) for _ in range(5)]
|
|
return sum(scores) / len(scores)
|
|
```
|
|
|
|
### Contamination check (n-gram)
|
|
```python
|
|
def contamination_check(test_examples, pretrain_corpus, n=13):
|
|
contaminated = 0
|
|
for ex in test_examples:
|
|
ngrams = set(get_ngrams(ex.text, n))
|
|
for doc in pretrain_corpus.search(ngrams):
|
|
if any(ng in doc for ng in ngrams):
|
|
contaminated += 1
|
|
break
|
|
return contaminated / len(test_examples)
|
|
```
|
|
|
|
### Pairwise human eval (Arena-style)
|
|
```python
|
|
def pairwise_eval(model_a, model_b, prompts, n_judges=10):
|
|
wins = {'a': 0, 'b': 0, 'tie': 0}
|
|
for prompt in prompts:
|
|
ra, rb = model_a.gen(prompt), model_b.gen(prompt)
|
|
# 매 randomize order
|
|
if random.random() < 0.5:
|
|
r1, r2, label = ra, rb, 'a'
|
|
else:
|
|
r1, r2, label = rb, ra, 'b'
|
|
|
|
votes = [human_judge(prompt, r1, r2) for _ in range(n_judges)]
|
|
winner = majority(votes)
|
|
if winner == 'tie': wins['tie'] += 1
|
|
elif winner == '1': wins[label] += 1
|
|
else: wins['a' if label == 'b' else 'b'] += 1
|
|
return wins
|
|
```
|
|
|
|
### Bradley-Terry (Elo) for LMSYS Arena
|
|
```python
|
|
import numpy as np
|
|
from sklearn.linear_model import LogisticRegression
|
|
|
|
def fit_elo(matches, models):
|
|
# matches: [(winner_idx, loser_idx), ...]
|
|
X = np.zeros((len(matches), len(models)))
|
|
y = np.ones(len(matches))
|
|
for i, (w, l) in enumerate(matches):
|
|
X[i, w] = 1
|
|
X[i, l] = -1
|
|
|
|
clf = LogisticRegression(fit_intercept=False).fit(X, y)
|
|
# 매 elo = scaled coefficient
|
|
return 400 / np.log(10) * clf.coef_[0] + 1000
|
|
```
|
|
|
|
## 🤔 결정 기준
|
|
| 목적 | Benchmark |
|
|
|---|---|
|
|
| LLM general | MMLU-Pro + GPQA + Arena |
|
|
| Math | MATH + AIME |
|
|
| Code | SWE-bench + LiveCodeBench |
|
|
| Instruction | IFEval + AlpacaEval |
|
|
| Safety | TruthfulQA + HarmBench |
|
|
| Long context | RULER + Needle |
|
|
| Agentic | GAIA + WebArena |
|
|
| Multi-modal | MMMU |
|
|
| Internal | Custom (task-specific) |
|
|
|
|
**기본값**: 매 multiple benchmark + 매 internal eval + 매 Arena 의 cross-check.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Evaluation]]
|
|
- 변형: [[MMLU]] · [[HumanEval]] · [[SWE-bench]] · [[GLUE]] · [[ImageNet]]
|
|
- Adjacent: [[Goodharts-Law]] · [[LLM-as-Judge]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 model selection. 매 fine-tune 효과 측정. 매 capability gap 의 identify.
|
|
**언제 X**: 매 single benchmark 의 비결로 의지. 매 contamination check 없이.
|
|
|
|
## ❌ 안티패턴
|
|
- **Single benchmark**: 매 game 의 vulnerable.
|
|
- **Public test set 의 train**: 매 contamination.
|
|
- **No Arena / human**: 매 academic ≠ 매 real.
|
|
- **Stale benchmark** (saturated): 매 information X.
|
|
- **LLM-as-judge 만**: 매 self-bias (GPT-4 가 GPT-4 의 favor).
|
|
- **No internal eval**: 매 task-specific gap 의 miss.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Stanford HELM, EleutherAI harness, LMSYS).
|
|
- 신뢰도 A.
|
|
- Related: [[MMLU]] · [[Goodharts-Law]] · [[Data-Contamination]] · [[LLM-as-Judge]].
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — benchmark catalog + contamination + 매 lm-eval / HELM code |
|