--- id: wiki-2026-0508-benchmarks title: Benchmarks (AI Evaluation) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [벤치마크, AI benchmarks, MMLU, HumanEval, MATH, GLUE, SuperGLUE, evaluation, leaderboard, Goodharts Law] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [benchmark, evaluation, mmlu, humaneval, math, swe-bench, contamination, leaderboard, helm] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: lm-evaluation-harness / HELM / OpenCompass --- # Benchmarks ## 📌 한 줄 통찰 > **"지능 의 줄자"**. 매 standardized 의 same comparison. 매 milestone + 매 marketing. 매 Goodhart's Law (매 metric 의 target 의 saturate). 매 modern era 의 contamination 의 worry. ## 📖 핵심 ### 매 NLP / LLM benchmark #### General reasoning - **MMLU** (57 subjects, multiple choice): 매 GPT 시대 의 standard. - **MMLU-Pro** (2024): 매 harder, 매 contamination 의 fix. - **GPQA** (graduate-level science): 매 hard. - **BIG-Bench Hard**: 매 LLM 의 weak point. - **AGIEval**: 매 SAT, GRE, LSAT. #### Math - **GSM8K** (grade school math): 매 saturated. - **MATH** (competition): 매 hard. - **AIME** / **IMO**: 매 frontier. #### Code - **HumanEval** (OpenAI): 매 saturated. - **MBPP**: 매 basic Python. - **SWE-bench** (Princeton): 매 real GitHub issue. - **LiveCodeBench**: 매 contamination-aware. #### Instruction following - **AlpacaEval** / **MT-Bench**: 매 LLM-as-judge. - **Arena (LMSYS)**: 매 human pairwise. - **IFEval**: 매 verifiable instruction. #### Long context - **Needle in Haystack**: 매 retrieval. - **RULER**: 매 multi-task. - **InfiniteBench**. #### Agentic / tool use - **WebArena** / **GAIA**: 매 real task. - **OSWorld**: 매 desktop GUI. - **τ-bench** (tau-bench): 매 customer service. #### Safety / alignment - **TruthfulQA**: 매 honesty. - **BBQ** (bias QA). - **HarmBench** / **AdvBench**: 매 jailbreak. - **MACHIAVELLI**: 매 power-seeking. ### 매 vision benchmark - **ImageNet**: 매 classification. - **COCO**: 매 detection / segmentation. - **VQAv2**: 매 visual QA. - **MMMU**: 매 multi-modal MMLU. ### 매 problem #### Goodhart's Law - "When a measure becomes a target, it ceases to be a good measure." - 매 saturated benchmark = 매 model 의 game. #### Data contamination - 매 pretraining data 의 매 test set leak. - 매 LLM 의 fake high score. - → 매 LiveCodeBench, 매 MMLU-Pro 의 mitigate. #### Construct validity - 매 measured ≠ 매 wanted. - 매 MMLU = 매 multiple-choice (real ≠). #### Distribution shift - 매 academic ≠ 매 real-world. #### Evaluation cost - 매 GPT-4 의 evaluation 의 expensive. - 매 LLM-as-judge 의 bias. ### 매 modern best practice 1. **Multiple benchmark**: 매 single 의 game 의 detect. 2. **Held-out test**: 매 fresh. 3. **Contamination check**: 매 n-gram match. 4. **LLM-as-judge audit**: 매 self-bias. 5. **Human preference** (Arena): 매 ground truth. 6. **HELM** (Stanford): 매 holistic, 매 multi-axis. 7. **Specific task eval**: 매 internal benchmark. ## 💻 패턴 ### lm-evaluation-harness (EleutherAI) ```bash pip install lm-eval lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3-8B \ --tasks mmlu,gsm8k,arc_challenge,truthfulqa \ --device cuda \ --batch_size 8 ``` → 매 standard 의 reproducible. ### HELM (Stanford) ```python # 매 holistic evaluation from helm.benchmark.run import run scenarios = [ 'mmlu', 'truthfulqa', 'bbq', 'real_toxicity_prompts', 'civil_comments', ] run(model='openai/gpt-4', scenarios=scenarios) ``` ### Custom internal benchmark ```python def evaluate_custom(model, test_cases): results = [] for case in test_cases: response = model.generate(case.prompt) score = case.judge(response) # 매 task-specific results.append({ 'case_id': case.id, 'score': score, 'response': response, 'expected': case.expected, }) # 매 metric breakdown by_category = group_by(results, 'category') for cat, items in by_category.items(): print(f'{cat}: {sum(i["score"] for i in items)/len(items):.3f}') return results ``` ### LLM-as-judge (with calibration) ```python def llm_judge(prompt, response, reference): judge_prompt = f"""Compare the response against the reference. Score 1-5 (5 = matches reference, 1 = wrong). Prompt: {prompt} Reference: {reference} Response: {response} Score: """ # 매 N=5 의 average (variance reduce) scores = [parse_score(judge_model.generate(judge_prompt)) for _ in range(5)] return sum(scores) / len(scores) ``` ### Contamination check (n-gram) ```python def contamination_check(test_examples, pretrain_corpus, n=13): contaminated = 0 for ex in test_examples: ngrams = set(get_ngrams(ex.text, n)) for doc in pretrain_corpus.search(ngrams): if any(ng in doc for ng in ngrams): contaminated += 1 break return contaminated / len(test_examples) ``` ### Pairwise human eval (Arena-style) ```python def pairwise_eval(model_a, model_b, prompts, n_judges=10): wins = {'a': 0, 'b': 0, 'tie': 0} for prompt in prompts: ra, rb = model_a.gen(prompt), model_b.gen(prompt) # 매 randomize order if random.random() < 0.5: r1, r2, label = ra, rb, 'a' else: r1, r2, label = rb, ra, 'b' votes = [human_judge(prompt, r1, r2) for _ in range(n_judges)] winner = majority(votes) if winner == 'tie': wins['tie'] += 1 elif winner == '1': wins[label] += 1 else: wins['a' if label == 'b' else 'b'] += 1 return wins ``` ### Bradley-Terry (Elo) for LMSYS Arena ```python import numpy as np from sklearn.linear_model import LogisticRegression def fit_elo(matches, models): # matches: [(winner_idx, loser_idx), ...] X = np.zeros((len(matches), len(models))) y = np.ones(len(matches)) for i, (w, l) in enumerate(matches): X[i, w] = 1 X[i, l] = -1 clf = LogisticRegression(fit_intercept=False).fit(X, y) # 매 elo = scaled coefficient return 400 / np.log(10) * clf.coef_[0] + 1000 ``` ## 🤔 결정 기준 | 목적 | Benchmark | |---|---| | LLM general | MMLU-Pro + GPQA + Arena | | Math | MATH + AIME | | Code | SWE-bench + LiveCodeBench | | Instruction | IFEval + AlpacaEval | | Safety | TruthfulQA + HarmBench | | Long context | RULER + Needle | | Agentic | GAIA + WebArena | | Multi-modal | MMMU | | Internal | Custom (task-specific) | **기본값**: 매 multiple benchmark + 매 internal eval + 매 Arena 의 cross-check. ## 🔗 Graph - 부모: [[Evaluation]] - 변형: [[MMLU]] · [[HumanEval]] · [[SWE-bench]] · [[GLUE]] · [[ImageNet]] - Adjacent: [[Goodharts-Law]] · [[LLM-as-Judge]] ## 🤖 LLM 활용 **언제**: 매 model selection. 매 fine-tune 효과 측정. 매 capability gap 의 identify. **언제 X**: 매 single benchmark 의 비결로 의지. 매 contamination check 없이. ## ❌ 안티패턴 - **Single benchmark**: 매 game 의 vulnerable. - **Public test set 의 train**: 매 contamination. - **No Arena / human**: 매 academic ≠ 매 real. - **Stale benchmark** (saturated): 매 information X. - **LLM-as-judge 만**: 매 self-bias (GPT-4 가 GPT-4 의 favor). - **No internal eval**: 매 task-specific gap 의 miss. ## 🧪 검증 / 중복 - Verified (Stanford HELM, EleutherAI harness, LMSYS). - 신뢰도 A. - Related: [[MMLU]] · [[Goodharts-Law]] · [[Data-Contamination]] · [[LLM-as-Judge]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — benchmark catalog + contamination + 매 lm-eval / HELM code |