f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.7 KiB
7.7 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-benchmarks | Benchmarks (AI Evaluation) | 10_Wiki/Topics | verified | self |
|
none | A | 0.93 | applied |
|
2026-05-10 | pending |
|
Benchmarks
📌 한 줄 통찰
"지능 의 줄자". 매 standardized 의 same comparison. 매 milestone + 매 marketing. 매 Goodhart's Law (매 metric 의 target 의 saturate). 매 modern era 의 contamination 의 worry.
📖 핵심
매 NLP / LLM benchmark
General reasoning
- MMLU (57 subjects, multiple choice): 매 GPT 시대 의 standard.
- MMLU-Pro (2024): 매 harder, 매 contamination 의 fix.
- GPQA (graduate-level science): 매 hard.
- BIG-Bench Hard: 매 LLM 의 weak point.
- AGIEval: 매 SAT, GRE, LSAT.
Math
- GSM8K (grade school math): 매 saturated.
- MATH (competition): 매 hard.
- AIME / IMO: 매 frontier.
Code
- HumanEval (OpenAI): 매 saturated.
- MBPP: 매 basic Python.
- SWE-bench (Princeton): 매 real GitHub issue.
- LiveCodeBench: 매 contamination-aware.
Instruction following
- AlpacaEval / MT-Bench: 매 LLM-as-judge.
- Arena (LMSYS): 매 human pairwise.
- IFEval: 매 verifiable instruction.
Long context
- Needle in Haystack: 매 retrieval.
- RULER: 매 multi-task.
- InfiniteBench.
Agentic / tool use
- WebArena / GAIA: 매 real task.
- OSWorld: 매 desktop GUI.
- τ-bench (tau-bench): 매 customer service.
Safety / alignment
- TruthfulQA: 매 honesty.
- BBQ (bias QA).
- HarmBench / AdvBench: 매 jailbreak.
- MACHIAVELLI: 매 power-seeking.
매 vision benchmark
- ImageNet: 매 classification.
- COCO: 매 detection / segmentation.
- VQAv2: 매 visual QA.
- MMMU: 매 multi-modal MMLU.
매 problem
Goodhart's Law
- "When a measure becomes a target, it ceases to be a good measure."
- 매 saturated benchmark = 매 model 의 game.
Data contamination
- 매 pretraining data 의 매 test set leak.
- 매 LLM 의 fake high score.
- → 매 LiveCodeBench, 매 MMLU-Pro 의 mitigate.
Construct validity
- 매 measured ≠ 매 wanted.
- 매 MMLU = 매 multiple-choice (real ≠).
Distribution shift
- 매 academic ≠ 매 real-world.
Evaluation cost
- 매 GPT-4 의 evaluation 의 expensive.
- 매 LLM-as-judge 의 bias.
매 modern best practice
- Multiple benchmark: 매 single 의 game 의 detect.
- Held-out test: 매 fresh.
- Contamination check: 매 n-gram match.
- LLM-as-judge audit: 매 self-bias.
- Human preference (Arena): 매 ground truth.
- HELM (Stanford): 매 holistic, 매 multi-axis.
- Specific task eval: 매 internal benchmark.
💻 패턴
lm-evaluation-harness (EleutherAI)
pip install lm-eval
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3-8B \
--tasks mmlu,gsm8k,arc_challenge,truthfulqa \
--device cuda \
--batch_size 8
→ 매 standard 의 reproducible.
HELM (Stanford)
# 매 holistic evaluation
from helm.benchmark.run import run
scenarios = [
'mmlu',
'truthfulqa',
'bbq',
'real_toxicity_prompts',
'civil_comments',
]
run(model='openai/gpt-4', scenarios=scenarios)
Custom internal benchmark
def evaluate_custom(model, test_cases):
results = []
for case in test_cases:
response = model.generate(case.prompt)
score = case.judge(response) # 매 task-specific
results.append({
'case_id': case.id,
'score': score,
'response': response,
'expected': case.expected,
})
# 매 metric breakdown
by_category = group_by(results, 'category')
for cat, items in by_category.items():
print(f'{cat}: {sum(i["score"] for i in items)/len(items):.3f}')
return results
LLM-as-judge (with calibration)
def llm_judge(prompt, response, reference):
judge_prompt = f"""Compare the response against the reference.
Score 1-5 (5 = matches reference, 1 = wrong).
Prompt: {prompt}
Reference: {reference}
Response: {response}
Score: """
# 매 N=5 의 average (variance reduce)
scores = [parse_score(judge_model.generate(judge_prompt)) for _ in range(5)]
return sum(scores) / len(scores)
Contamination check (n-gram)
def contamination_check(test_examples, pretrain_corpus, n=13):
contaminated = 0
for ex in test_examples:
ngrams = set(get_ngrams(ex.text, n))
for doc in pretrain_corpus.search(ngrams):
if any(ng in doc for ng in ngrams):
contaminated += 1
break
return contaminated / len(test_examples)
Pairwise human eval (Arena-style)
def pairwise_eval(model_a, model_b, prompts, n_judges=10):
wins = {'a': 0, 'b': 0, 'tie': 0}
for prompt in prompts:
ra, rb = model_a.gen(prompt), model_b.gen(prompt)
# 매 randomize order
if random.random() < 0.5:
r1, r2, label = ra, rb, 'a'
else:
r1, r2, label = rb, ra, 'b'
votes = [human_judge(prompt, r1, r2) for _ in range(n_judges)]
winner = majority(votes)
if winner == 'tie': wins['tie'] += 1
elif winner == '1': wins[label] += 1
else: wins['a' if label == 'b' else 'b'] += 1
return wins
Bradley-Terry (Elo) for LMSYS Arena
import numpy as np
from sklearn.linear_model import LogisticRegression
def fit_elo(matches, models):
# matches: [(winner_idx, loser_idx), ...]
X = np.zeros((len(matches), len(models)))
y = np.ones(len(matches))
for i, (w, l) in enumerate(matches):
X[i, w] = 1
X[i, l] = -1
clf = LogisticRegression(fit_intercept=False).fit(X, y)
# 매 elo = scaled coefficient
return 400 / np.log(10) * clf.coef_[0] + 1000
🤔 결정 기준
| 목적 | Benchmark |
|---|---|
| LLM general | MMLU-Pro + GPQA + Arena |
| Math | MATH + AIME |
| Code | SWE-bench + LiveCodeBench |
| Instruction | IFEval + AlpacaEval |
| Safety | TruthfulQA + HarmBench |
| Long context | RULER + Needle |
| Agentic | GAIA + WebArena |
| Multi-modal | MMMU |
| Internal | Custom (task-specific) |
기본값: 매 multiple benchmark + 매 internal eval + 매 Arena 의 cross-check.
🔗 Graph
- 부모: Evaluation
- 변형: MMLU · HumanEval · SWE-bench · GLUE · ImageNet
- Adjacent: Goodharts-Law · LLM-as-Judge
🤖 LLM 활용
언제: 매 model selection. 매 fine-tune 효과 측정. 매 capability gap 의 identify. 언제 X: 매 single benchmark 의 비결로 의지. 매 contamination check 없이.
❌ 안티패턴
- Single benchmark: 매 game 의 vulnerable.
- Public test set 의 train: 매 contamination.
- No Arena / human: 매 academic ≠ 매 real.
- Stale benchmark (saturated): 매 information X.
- LLM-as-judge 만: 매 self-bias (GPT-4 가 GPT-4 의 favor).
- No internal eval: 매 task-specific gap 의 miss.
🧪 검증 / 중복
- Verified (Stanford HELM, EleutherAI harness, LMSYS).
- 신뢰도 A.
- Related: MMLU · Goodharts-Law · Data-Contamination · LLM-as-Judge.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — benchmark catalog + contamination + 매 lm-eval / HELM code |