"좋다" vs "측정". 매 capability (math, code, reasoning, long-context, tool use) 의 standardized test. 단점: contamination, Goodhart's law, eval ≠ real-world. Modern = LMSys Arena (human pref) + SWE-bench (real task) + custom domain eval.
📖 구조화된 지식 (Synthesized Content)
Benchmark 의 family
1. Knowledge / 추론
Benchmark
측정
Note
MMLU (57 subject)
다영역 지식
가장 인기. 90%+ saturated.
MMLU-Pro
MMLU 확장, 더 어려움
50% 정도 가 frontier.
GPQA
PhD-level science
잘 saturated 안 됨.
HellaSwag
상식 추론
옛, saturated.
ARC-AGI
Pattern reasoning
OpenAI o3 가 75% (인간 = 85%).
2. Math
Benchmark
측정
GSM8K
초등 multi-step
MATH
경시대회 problem
AIME
American math olympiad
FrontierMath
Research-level
3. Code
Benchmark
측정
HumanEval
Python 함수 생성
MBPP
Python coding
SWE-bench
Real GitHub issue
SWE-bench Verified
Curated subset
BigCodeBench
Complex Python
LiveCodeBench
Recent (LeetCode)
4. Long context
Benchmark
측정
NIAH (Needle in a Haystack)
"needle" sentence 의 retrieval
RULER
Multi-needle, summarize, multi-hop
LongBench
Long doc QA
Loong
Multi-doc reasoning
5. Agent / tool
Benchmark
측정
GAIA
Real-world tasks (web, file)
SWE-bench
Code agent
WebArena / VisualWebArena
Browser agent
MCP-Atlas
Tool composition
τ-bench
Customer service simulation
6. Real-world / human pref
Benchmark
측정
LMSYS Chatbot Arena
Blind A/B + Elo
MT-Bench
Multi-turn quality (LLM-judge)
AlpacaEval
LLM-judge
Vibes
Subjective pref (community)
7. Safety / alignment
Benchmark
측정
TruthfulQA
거짓 안 말함
HarmBench
Refuse harmful
Anthropic Persuasion
Constitutional AI eval
함정 (Goodhart's Law in AI)
Contamination: train data 가 benchmark 가 leak → 가짜 high score. 매 frontier model 의 의심.
Overfitting: 매 release 의 specific benchmark optimization.
"솔루션 lookup": GSM8K 의 Q 가 train data 에. Model 가 reasoning X, retrieval.
Synthetic data 의 saturation: 같은 LLM 가 만든 Q 의 같은 LLM 가 풀어.
Real-world ≠ benchmark: high score + bad UX 의 흔함.
Subjective: chatbot quality 의 measure 가 tricky.
→ Benchmark 의 lifecycle: 새 → 의미 → saturated → 의미 X → retire.
새 benchmark 의 trend
Live / dynamic (LiveCodeBench, ARC-AGI): 매월 update.
Verified (SWE-bench Verified): human-curated.
Real task (GAIA, τ-bench): 실제 work.
Human pref (Arena): hard to game.
Domain-specific: medical (MedQA), legal (LegalBench), scientific.
💻 코드 패턴 (Code Patterns)
lm-eval-harness (EleutherAI 표준)
pip install lm-eval
# Run benchmark
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-8B \
--tasks mmlu,gsm8k,humaneval \
--batch_size 8# 결과 = JSON
Promptfoo (custom eval)
# promptfooconfig.yamlprompts:- 'Solve this math problem:{{problem}}'providers:- openai:gpt-4o-mini- anthropic:claude-haiku-4-5tests:- vars:problem:'If a train travels 60 mph for 2 hours, how far?'assert:- type:containsvalue:'120'
defjudge(question,answer,expected):prompt=f'''
Score the answer on 1-10 scale.
Question: {question}Expected: {expected}Answer: {answer}Output JSON: {{"score": N, "reason": "..."}}'''returnjson.loads(judge_llm.complete(prompt))
→ Cheap + scale. Bias 위험 (same model 이 자체 평가 가 bias).
Custom benchmark 작성
importjson# Golden settest_cases=[{'input':'What is 2+2?','expected':'4'},{'input':'Capital of France?','expected':'Paris'},# ... 100+]defevaluate(model):correct=0forcaseintest_cases:answer=model.complete(case['input'])ifmatch(answer,case['expected']):correct+=1returncorrect/len(test_cases)
Inspect (UK AISI)
frominspect_aiimportTask,task,evalfrominspect_ai.datasetimportSamplefrominspect_ai.scorerimportmatchfrominspect_ai.solverimportgenerate@taskdefmy_task():returnTask(dataset=[Sample(input='Capital of France?',target='Paris'),Sample(input='What is 2+2?',target='4'),],plan=[generate()],scorer=match(),)eval(my_task(),model='openai/gpt-4o-mini')
# MedQA-styletest=[{'q':'Patient has fever, cough, fatigue. Most likely?','options':['flu','covid','allergies','cancer'],'correct':'flu'or'covid'(context-dep),},]# Score = top-1 또는 top-2 accuracy.
Continuous eval (production)
@tracedefchat(query):response=llm.complete(query)log({'query':query,'response':response,'tokens':...})returnresponse# Daily:# 1. Sample 100 production query.# 2. LLM-judge score.# 3. Trend over time.
→ Drift detect.
🤔 의사결정 기준 (Decision Criteria)
작업
Benchmark
Generic capability
MMLU + GSM8K + HumanEval
Long context
RULER (NIAH 가 too easy)
Real-world coding
SWE-bench Verified
Real-world agent
GAIA / τ-bench
Human-perceived quality
LMSys Arena Elo
Math reasoning
AIME / FrontierMath
Domain (의료, 법)
Domain-specific (MedQA, LegalBench)
Production app
Custom golden set + LLM-judge
Safety
TruthfulQA + HarmBench
기본값: Custom domain eval (production traffic) + Promptfoo CI gate. 매 release 의 regression 검증.
⚠️ 모순 및 업데이트 (Contradictions & Updates)
Saturation 빠름: MMLU 90% saturated. 매 6 month 의 새 benchmark 필요.
Real-world 차이: high benchmark + bad UX 흔함. Production eval 가 더 중요.
Contamination 의 epidemic: 매 frontier model 의 의심. Live benchmark (LiveCodeBench) 가 답.
Bench shopping: vendor 가 자기 best benchmark 만 publish. 매 case 의 cherry-pick.
Multi-modal: text 만 X. Image (MMMU), video (Video-MME), audio.
Reasoning trace 의 eval: o1 / R1 의 chain-of-thought 의 quality 측정 = 새 challenge.