--- id: wiki-2026-0508-ai-evaluation-benchmarks title: AI Evaluation & Benchmarks category: 10_Wiki/Topics status: verified canonical_id: self aliases: [LLM eval, model benchmark, MMLU, HumanEval, SWE-bench, Chatbot Arena, NIAH, RULER] duplicate_of: none source_trust_level: B confidence_score: 0.9 verification_status: conceptual tags: [llm-eval, benchmark, mmlu, humaneval, swe-bench, chatbot-arena, niah, contamination, ai-quality] raw_sources: [] last_reinforced: 2026-05-09 github_commit: pending inferred_by: Claude Opus 4.7 (manual cleanup 2026-05-09) tech_stack: language: Python / TS framework: Promptfoo / LangSmith / Inspect / lm-eval-harness --- # AI Evaluation & Benchmarks ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > **"์ข‹๋‹ค" vs "์ธก์ •"**. ๋งค capability (math, code, reasoning, long-context, tool use) ์˜ standardized test. ๋‹จ์ : contamination, Goodhart's law, eval โ‰  real-world. Modern = LMSys Arena (human pref) + SWE-bench (real task) + custom domain eval. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ### Benchmark ์˜ family #### 1. Knowledge / ์ถ”๋ก  | Benchmark | ์ธก์ • | Note | |---|---|---| | **MMLU** (57 subject) | ๋‹ค์˜์—ญ ์ง€์‹ | ๊ฐ€์žฅ ์ธ๊ธฐ. 90%+ saturated. | | **MMLU-Pro** | MMLU ํ™•์žฅ, ๋” ์–ด๋ ค์›€ | 50% ์ •๋„ ๊ฐ€ frontier. | | **GPQA** | PhD-level science | ์ž˜ saturated ์•ˆ ๋จ. | | **HellaSwag** | ์ƒ์‹ ์ถ”๋ก  | ์˜›, saturated. | | **ARC-AGI** | Pattern reasoning | OpenAI o3 ๊ฐ€ 75% (์ธ๊ฐ„ = 85%). | #### 2. Math | Benchmark | ์ธก์ • | |---|---| | **GSM8K** | ์ดˆ๋“ฑ multi-step | Saturated (95%+). | | **MATH** | ๊ฒฝ์‹œ๋Œ€ํšŒ problem | Frontier 70-90%. | | **AIME** | American math olympiad | Hard. o1/R1 ๊ฐ€ ์ž˜. | | **FrontierMath** | Research-level | <5% saturate. | #### 3. Code | Benchmark | ์ธก์ • | |---|---| | **HumanEval** | Python ํ•จ์ˆ˜ ์ƒ์„ฑ | Saturated (95%+). | | **MBPP** | Python coding | Saturated. | | **SWE-bench** | Real GitHub issue | Frontier ~50-60%. | | **SWE-bench Verified** | Curated subset | More reliable. | | **BigCodeBench** | Complex Python | Frontier ~30-50%. | | **LiveCodeBench** | Recent (LeetCode) | ๋งค์›” update (contamination ๋ฐฉ์ง€). | #### 4. Long context | Benchmark | ์ธก์ • | |---|---| | **NIAH (Needle in a Haystack)** | "needle" sentence ์˜ retrieval | Trivial ๊ฐ€ ๋จ โ€” too easy. | | **RULER** | Multi-needle, summarize, multi-hop | More realistic. | | **LongBench** | Long doc QA | | | **Loong** | Multi-doc reasoning | | #### 5. Agent / tool | Benchmark | ์ธก์ • | |---|---| | **GAIA** | Real-world tasks (web, file) | Frontier ~30%. | | **SWE-bench** | Code agent | Devin / Cursor benchmark. | | **WebArena / VisualWebArena** | Browser agent | <30% saturate. | | **MCP-Atlas** | Tool composition | | | **ฯ„-bench** | Customer service simulation | | #### 6. Real-world / human pref | Benchmark | ์ธก์ • | |---|---| | **LMSYS Chatbot Arena** | Blind A/B + Elo | Most trusted real-world signal. | | **MT-Bench** | Multi-turn quality (LLM-judge) | | | **AlpacaEval** | LLM-judge | | | **Vibes** | Subjective pref (community) | | #### 7. Safety / alignment | Benchmark | ์ธก์ • | |---|---| | **TruthfulQA** | ๊ฑฐ์ง“ ์•ˆ ๋งํ•จ | | | **HarmBench** | Refuse harmful | | | **Anthropic Persuasion** | | | **Constitutional AI eval** | | ### ํ•จ์ • (Goodhart's Law in AI) 1. **Contamination**: train data ๊ฐ€ benchmark ๊ฐ€ leak โ†’ ๊ฐ€์งœ high score. ๋งค frontier model ์˜ ์˜์‹ฌ. 2. **Overfitting**: ๋งค release ์˜ specific benchmark optimization. 3. **"์†”๋ฃจ์…˜ lookup"**: GSM8K ์˜ Q ๊ฐ€ train data ์—. Model ๊ฐ€ reasoning X, retrieval. 4. **Synthetic data ์˜ saturation**: ๊ฐ™์€ LLM ๊ฐ€ ๋งŒ๋“  Q ์˜ ๊ฐ™์€ LLM ๊ฐ€ ํ’€์–ด. 5. **Real-world โ‰  benchmark**: high score + bad UX ์˜ ํ”ํ•จ. 6. **Subjective**: chatbot quality ์˜ measure ๊ฐ€ tricky. โ†’ Benchmark ์˜ lifecycle: ์ƒˆ โ†’ ์˜๋ฏธ โ†’ saturated โ†’ ์˜๋ฏธ X โ†’ retire. ### ์ƒˆ benchmark ์˜ trend - **Live / dynamic** (LiveCodeBench, ARC-AGI): ๋งค์›” update. - **Verified** (SWE-bench Verified): human-curated. - **Real task** (GAIA, ฯ„-bench): ์‹ค์ œ work. - **Human pref** (Arena): hard to game. - **Domain-specific**: medical (MedQA), legal (LegalBench), scientific. ## ๐Ÿ’ป ์ฝ”๋“œ ํŒจํ„ด (Code Patterns) ### lm-eval-harness (EleutherAI ํ‘œ์ค€) ```bash pip install lm-eval # Run benchmark lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-8B \ --tasks mmlu,gsm8k,humaneval \ --batch_size 8 # ๊ฒฐ๊ณผ = JSON ``` ### Promptfoo (custom eval) ```yaml # promptfooconfig.yaml prompts: - 'Solve this math problem: {{problem}}' providers: - openai:gpt-4o-mini - anthropic:claude-haiku-4-5 tests: - vars: problem: 'If a train travels 60 mph for 2 hours, how far?' assert: - type: contains value: '120' ``` ```bash promptfoo eval ``` ### LangSmith eval ```python from langsmith import Client from langchain.smith import RunEvalConfig client = Client() results = client.run_on_dataset( dataset_name='math-questions', llm_or_chain=chain, evaluation=RunEvalConfig(evaluators=['qa', 'context_qa']), ) ``` ### LLM-as-judge ```python def judge(question, answer, expected): prompt = f''' Score the answer on 1-10 scale. Question: {question} Expected: {expected} Answer: {answer} Output JSON: {{"score": N, "reason": "..."}} ''' return json.loads(judge_llm.complete(prompt)) ``` โ†’ Cheap + scale. Bias ์œ„ํ—˜ (same model ์ด ์ž์ฒด ํ‰๊ฐ€ ๊ฐ€ bias). ### Custom benchmark ์ž‘์„ฑ ```python import json # Golden set test_cases = [ {'input': 'What is 2+2?', 'expected': '4'}, {'input': 'Capital of France?', 'expected': 'Paris'}, # ... 100+ ] def evaluate(model): correct = 0 for case in test_cases: answer = model.complete(case['input']) if match(answer, case['expected']): correct += 1 return correct / len(test_cases) ``` ### Inspect (UK AISI) ```python from inspect_ai import Task, task, eval from inspect_ai.dataset import Sample from inspect_ai.scorer import match from inspect_ai.solver import generate @task def my_task(): return Task( dataset=[ Sample(input='Capital of France?', target='Paris'), Sample(input='What is 2+2?', target='4'), ], plan=[generate()], scorer=match(), ) eval(my_task(), model='openai/gpt-4o-mini') ``` โ†’ AISI / safety-focused. ### Contamination check ```python # n-gram overlap (๋‚ฎ์€ = OK) def check_contamination(test_set, train_set, n=8): train_ngrams = set() for doc in train_set: tokens = doc.split() for i in range(len(tokens) - n + 1): train_ngrams.add(tuple(tokens[i:i+n])) overlapping = 0 for q in test_set: tokens = q.split() for i in range(len(tokens) - n + 1): if tuple(tokens[i:i+n]) in train_ngrams: overlapping += 1 break return overlapping / len(test_set) ``` โ†’ 5%+ overlap = ์˜์‹ฌ. ### Domain-specific eval (์˜ˆ: ์˜๋ฃŒ) ```python # MedQA-style test = [ { 'q': 'Patient has fever, cough, fatigue. Most likely?', 'options': ['flu', 'covid', 'allergies', 'cancer'], 'correct': 'flu' or 'covid' (context-dep), }, ] # Score = top-1 ๋˜๋Š” top-2 accuracy. ``` ### Continuous eval (production) ```python @trace def chat(query): response = llm.complete(query) log({'query': query, 'response': response, 'tokens': ...}) return response # Daily: # 1. Sample 100 production query. # 2. LLM-judge score. # 3. Trend over time. ``` โ†’ Drift detect. ## ๐Ÿค” ์˜์‚ฌ๊ฒฐ์ • ๊ธฐ์ค€ (Decision Criteria) | ์ž‘์—… | Benchmark | |---|---| | Generic capability | MMLU + GSM8K + HumanEval | | Long context | RULER (NIAH ๊ฐ€ too easy) | | Real-world coding | SWE-bench Verified | | Real-world agent | GAIA / ฯ„-bench | | Human-perceived quality | LMSys Arena Elo | | Math reasoning | AIME / FrontierMath | | Domain (์˜๋ฃŒ, ๋ฒ•) | Domain-specific (MedQA, LegalBench) | | Production app | Custom golden set + LLM-judge | | Safety | TruthfulQA + HarmBench | **๊ธฐ๋ณธ๊ฐ’**: Custom domain eval (production traffic) + Promptfoo CI gate. ๋งค release ์˜ regression ๊ฒ€์ฆ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) - **Saturation ๋น ๋ฆ„**: MMLU 90% saturated. ๋งค 6 month ์˜ ์ƒˆ benchmark ํ•„์š”. - **Real-world ์ฐจ์ด**: high benchmark + bad UX ํ”ํ•จ. Production eval ๊ฐ€ ๋” ์ค‘์š”. - **Contamination ์˜ epidemic**: ๋งค frontier model ์˜ ์˜์‹ฌ. Live benchmark (LiveCodeBench) ๊ฐ€ ๋‹ต. - **Bench shopping**: vendor ๊ฐ€ ์ž๊ธฐ best benchmark ๋งŒ publish. ๋งค case ์˜ cherry-pick. - **Multi-modal**: text ๋งŒ X. Image (MMMU), video (Video-MME), audio. - **Reasoning trace ์˜ eval**: o1 / R1 ์˜ chain-of-thought ์˜ quality ์ธก์ • = ์ƒˆ challenge. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - ๋ณ€ํ˜•: [[LLM-as-Judge]] - Tools: lm-eval-harness ยท Promptfoo ยท LangSmith ยท Inspect (AISI) ยท Braintrust ยท Helicone ยท Langfuse - Related: [[Code Agent โ€” Devin / Cursor / Claude Code]] ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - ์ƒˆ LLM ์˜ quality ๋น„๊ต (์–ด๋–ค model ์‚ฌ์šฉ ๊ฒฐ์ •). - Production system ์˜ release gate ์˜ eval ๋””์ž์ธ. - ๋งค prompt ์˜ ๋ณ€๊ฒฝ ์‹œ regression ๊ฒ€์ฆ. - Domain-specific application ์˜ quality ์ธก์ •. - Vendor ์˜ marketing claim ์˜ reality check. **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - Benchmark ๋งŒ ์˜์กด (real user feedback ์—†์ด). - Single benchmark + decision (overfit risk). - Contaminated benchmark + ์‹ ๋ขฐ. - ๋น„์‹ผ frontier model ์˜ ์ž‘์€ task (overkill). - Domain eval ์—†์ด generic ๋งŒ (production fail). ## โŒ ์•ˆํ‹ฐํŒจํ„ด (Anti-Patterns) - **Single benchmark + claim "best"**: cherry-pick. Multi-benchmark. - **Contamination check ์•ˆ ํ•จ**: ๊ฐ€์งœ score. - **Static benchmark + ๋งค๋…„**: saturation = ์˜๋ฏธ X. - **No human eval**: LLM-judge ๋งŒ = bias. - **No production eval**: benchmark vs reality gap. - **Benchmark ๊ฐ€ train data**: model ์˜ dishonest. - **Eval cost ๋ฌด์‹œ**: GPT-4 judge ร— 10k case = $$. - **Saturated benchmark ๋ณด๊ณ  model ์˜ ceiling ์ถ”์ •**: ๋งค model ์˜ ceiling ์˜ misjudge. ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** verified (concept-level). - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** B (Hugging Face leaderboard, Stanford HAI report, Papers With Code). - **๊ฒ€ํ†  ์ด์œ :** Manual cleanup. ๋งค specific benchmark ์˜ number ๊ฐ€ ๋งค์›” change. ๋งค 6 month review ์ถ”์ฒœ. ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** [[LLM-Capabilities]] (related), [[Continuous-Learning-System]] (production eval), [[AI_Eval_Framework_Modern]] (tools). - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** KEEP (overview of benchmarks). - **์ฒ˜๋ฆฌ ์ด์œ :** Tool / framework ์™€ ์˜ separate. ๋งค benchmark ์˜ detail. ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” | UPDATE | A | | 2026-05-09 | Manual cleanup โ€” code pattern + benchmark family + ์˜์‚ฌ๊ฒฐ์ • + ์•ˆํ‹ฐํŒจํ„ด ์ถ”๊ฐ€ | UPDATE | B |