"지능 의 줄자". 매 standardized 의 same comparison. 매 milestone + 매 marketing. 매 Goodhart's Law (매 metric 의 target 의 saturate). 매 modern era 의 contamination 의 worry.
📖 핵심
매 NLP / LLM benchmark
General reasoning
MMLU (57 subjects, multiple choice): 매 GPT 시대 의 standard.
MMLU-Pro (2024): 매 harder, 매 contamination 의 fix.
GPQA (graduate-level science): 매 hard.
BIG-Bench Hard: 매 LLM 의 weak point.
AGIEval: 매 SAT, GRE, LSAT.
Math
GSM8K (grade school math): 매 saturated.
MATH (competition): 매 hard.
AIME / IMO: 매 frontier.
Code
HumanEval (OpenAI): 매 saturated.
MBPP: 매 basic Python.
SWE-bench (Princeton): 매 real GitHub issue.
LiveCodeBench: 매 contamination-aware.
Instruction following
AlpacaEval / MT-Bench: 매 LLM-as-judge.
Arena (LMSYS): 매 human pairwise.
IFEval: 매 verifiable instruction.
Long context
Needle in Haystack: 매 retrieval.
RULER: 매 multi-task.
InfiniteBench.
Agentic / tool use
WebArena / GAIA: 매 real task.
OSWorld: 매 desktop GUI.
τ-bench (tau-bench): 매 customer service.
Safety / alignment
TruthfulQA: 매 honesty.
BBQ (bias QA).
HarmBench / AdvBench: 매 jailbreak.
MACHIAVELLI: 매 power-seeking.
매 vision benchmark
ImageNet: 매 classification.
COCO: 매 detection / segmentation.
VQAv2: 매 visual QA.
MMMU: 매 multi-modal MMLU.
매 problem
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure."
# 매 holistic evaluationfromhelm.benchmark.runimportrunscenarios=['mmlu','truthfulqa','bbq','real_toxicity_prompts','civil_comments',]run(model='openai/gpt-4',scenarios=scenarios)
Custom internal benchmark
defevaluate_custom(model,test_cases):results=[]forcaseintest_cases:response=model.generate(case.prompt)score=case.judge(response)# 매 task-specificresults.append({'case_id':case.id,'score':score,'response':response,'expected':case.expected,})# 매 metric breakdownby_category=group_by(results,'category')forcat,itemsinby_category.items():print(f'{cat}: {sum(i["score"]foriinitems)/len(items):.3f}')returnresults
LLM-as-judge (with calibration)
defllm_judge(prompt,response,reference):judge_prompt=f"""Compare the response against the reference.
Score 1-5 (5 = matches reference, 1 = wrong).
Prompt: {prompt}Reference: {reference}Response: {response}Score: """# 매 N=5 의 average (variance reduce)scores=[parse_score(judge_model.generate(judge_prompt))for_inrange(5)]returnsum(scores)/len(scores)
defpairwise_eval(model_a,model_b,prompts,n_judges=10):wins={'a':0,'b':0,'tie':0}forpromptinprompts:ra,rb=model_a.gen(prompt),model_b.gen(prompt)# 매 randomize orderifrandom.random()<0.5:r1,r2,label=ra,rb,'a'else:r1,r2,label=rb,ra,'b'votes=[human_judge(prompt,r1,r2)for_inrange(n_judges)]winner=majority(votes)ifwinner=='tie':wins['tie']+=1elifwinner=='1':wins[label]+=1else:wins['a'iflabel=='b'else'b']+=1returnwins
Bradley-Terry (Elo) for LMSYS Arena
importnumpyasnpfromsklearn.linear_modelimportLogisticRegressiondeffit_elo(matches,models):# matches: [(winner_idx, loser_idx), ...]X=np.zeros((len(matches),len(models)))y=np.ones(len(matches))fori,(w,l)inenumerate(matches):X[i,w]=1X[i,l]=-1clf=LogisticRegression(fit_intercept=False).fit(X,y)# 매 elo = scaled coefficientreturn400/np.log(10)*clf.coef_[0]+1000
🤔 결정 기준
목적
Benchmark
LLM general
MMLU-Pro + GPQA + Arena
Math
MATH + AIME
Code
SWE-bench + LiveCodeBench
Instruction
IFEval + AlpacaEval
Safety
TruthfulQA + HarmBench
Long context
RULER + Needle
Agentic
GAIA + WebArena
Multi-modal
MMMU
Internal
Custom (task-specific)
기본값: 매 multiple benchmark + 매 internal eval + 매 Arena 의 cross-check.