Files
2nd/10_Wiki/Topics/AI_and_ML/Benchmarks.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.7 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-benchmarks Benchmarks (AI Evaluation) 10_Wiki/Topics verified self
벤치마크
AI benchmarks
MMLU
HumanEval
MATH
GLUE
SuperGLUE
evaluation
leaderboard
Goodharts Law
none A 0.93 applied
benchmark
evaluation
mmlu
humaneval
math
swe-bench
contamination
leaderboard
helm
2026-05-10 pending
language framework
Python lm-evaluation-harness / HELM / OpenCompass

Benchmarks

📌 한 줄 통찰

"지능 의 줄자". 매 standardized 의 same comparison. 매 milestone + 매 marketing. 매 Goodhart's Law (매 metric 의 target 의 saturate). 매 modern era 의 contamination 의 worry.

📖 핵심

매 NLP / LLM benchmark

General reasoning

  • MMLU (57 subjects, multiple choice): 매 GPT 시대 의 standard.
  • MMLU-Pro (2024): 매 harder, 매 contamination 의 fix.
  • GPQA (graduate-level science): 매 hard.
  • BIG-Bench Hard: 매 LLM 의 weak point.
  • AGIEval: 매 SAT, GRE, LSAT.

Math

  • GSM8K (grade school math): 매 saturated.
  • MATH (competition): 매 hard.
  • AIME / IMO: 매 frontier.

Code

  • HumanEval (OpenAI): 매 saturated.
  • MBPP: 매 basic Python.
  • SWE-bench (Princeton): 매 real GitHub issue.
  • LiveCodeBench: 매 contamination-aware.

Instruction following

  • AlpacaEval / MT-Bench: 매 LLM-as-judge.
  • Arena (LMSYS): 매 human pairwise.
  • IFEval: 매 verifiable instruction.

Long context

  • Needle in Haystack: 매 retrieval.
  • RULER: 매 multi-task.
  • InfiniteBench.

Agentic / tool use

  • WebArena / GAIA: 매 real task.
  • OSWorld: 매 desktop GUI.
  • τ-bench (tau-bench): 매 customer service.

Safety / alignment

  • TruthfulQA: 매 honesty.
  • BBQ (bias QA).
  • HarmBench / AdvBench: 매 jailbreak.
  • MACHIAVELLI: 매 power-seeking.

매 vision benchmark

  • ImageNet: 매 classification.
  • COCO: 매 detection / segmentation.
  • VQAv2: 매 visual QA.
  • MMMU: 매 multi-modal MMLU.

매 problem

Goodhart's Law

  • "When a measure becomes a target, it ceases to be a good measure."
  • 매 saturated benchmark = 매 model 의 game.

Data contamination

  • 매 pretraining data 의 매 test set leak.
  • 매 LLM 의 fake high score.
  • → 매 LiveCodeBench, 매 MMLU-Pro 의 mitigate.

Construct validity

  • 매 measured ≠ 매 wanted.
  • 매 MMLU = 매 multiple-choice (real ≠).

Distribution shift

  • 매 academic ≠ 매 real-world.

Evaluation cost

  • 매 GPT-4 의 evaluation 의 expensive.
  • 매 LLM-as-judge 의 bias.

매 modern best practice

  1. Multiple benchmark: 매 single 의 game 의 detect.
  2. Held-out test: 매 fresh.
  3. Contamination check: 매 n-gram match.
  4. LLM-as-judge audit: 매 self-bias.
  5. Human preference (Arena): 매 ground truth.
  6. HELM (Stanford): 매 holistic, 매 multi-axis.
  7. Specific task eval: 매 internal benchmark.

💻 패턴

lm-evaluation-harness (EleutherAI)

pip install lm-eval

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3-8B \
  --tasks mmlu,gsm8k,arc_challenge,truthfulqa \
  --device cuda \
  --batch_size 8

→ 매 standard 의 reproducible.

HELM (Stanford)

# 매 holistic evaluation
from helm.benchmark.run import run

scenarios = [
    'mmlu',
    'truthfulqa',
    'bbq',
    'real_toxicity_prompts',
    'civil_comments',
]
run(model='openai/gpt-4', scenarios=scenarios)

Custom internal benchmark

def evaluate_custom(model, test_cases):
    results = []
    for case in test_cases:
        response = model.generate(case.prompt)
        score = case.judge(response)  # 매 task-specific
        results.append({
            'case_id': case.id,
            'score': score,
            'response': response,
            'expected': case.expected,
        })
    
    # 매 metric breakdown
    by_category = group_by(results, 'category')
    for cat, items in by_category.items():
        print(f'{cat}: {sum(i["score"] for i in items)/len(items):.3f}')
    
    return results

LLM-as-judge (with calibration)

def llm_judge(prompt, response, reference):
    judge_prompt = f"""Compare the response against the reference.
Score 1-5 (5 = matches reference, 1 = wrong).

Prompt: {prompt}
Reference: {reference}
Response: {response}

Score: """
    
    # 매 N=5 의 average (variance reduce)
    scores = [parse_score(judge_model.generate(judge_prompt)) for _ in range(5)]
    return sum(scores) / len(scores)

Contamination check (n-gram)

def contamination_check(test_examples, pretrain_corpus, n=13):
    contaminated = 0
    for ex in test_examples:
        ngrams = set(get_ngrams(ex.text, n))
        for doc in pretrain_corpus.search(ngrams):
            if any(ng in doc for ng in ngrams):
                contaminated += 1
                break
    return contaminated / len(test_examples)

Pairwise human eval (Arena-style)

def pairwise_eval(model_a, model_b, prompts, n_judges=10):
    wins = {'a': 0, 'b': 0, 'tie': 0}
    for prompt in prompts:
        ra, rb = model_a.gen(prompt), model_b.gen(prompt)
        # 매 randomize order
        if random.random() < 0.5:
            r1, r2, label = ra, rb, 'a'
        else:
            r1, r2, label = rb, ra, 'b'
        
        votes = [human_judge(prompt, r1, r2) for _ in range(n_judges)]
        winner = majority(votes)
        if winner == 'tie': wins['tie'] += 1
        elif winner == '1': wins[label] += 1
        else: wins['a' if label == 'b' else 'b'] += 1
    return wins

Bradley-Terry (Elo) for LMSYS Arena

import numpy as np
from sklearn.linear_model import LogisticRegression

def fit_elo(matches, models):
    # matches: [(winner_idx, loser_idx), ...]
    X = np.zeros((len(matches), len(models)))
    y = np.ones(len(matches))
    for i, (w, l) in enumerate(matches):
        X[i, w] = 1
        X[i, l] = -1
    
    clf = LogisticRegression(fit_intercept=False).fit(X, y)
    # 매 elo = scaled coefficient
    return 400 / np.log(10) * clf.coef_[0] + 1000

🤔 결정 기준

목적 Benchmark
LLM general MMLU-Pro + GPQA + Arena
Math MATH + AIME
Code SWE-bench + LiveCodeBench
Instruction IFEval + AlpacaEval
Safety TruthfulQA + HarmBench
Long context RULER + Needle
Agentic GAIA + WebArena
Multi-modal MMMU
Internal Custom (task-specific)

기본값: 매 multiple benchmark + 매 internal eval + 매 Arena 의 cross-check.

🔗 Graph

🤖 LLM 활용

언제: 매 model selection. 매 fine-tune 효과 측정. 매 capability gap 의 identify. 언제 X: 매 single benchmark 의 비결로 의지. 매 contamination check 없이.

안티패턴

  • Single benchmark: 매 game 의 vulnerable.
  • Public test set 의 train: 매 contamination.
  • No Arena / human: 매 academic ≠ 매 real.
  • Stale benchmark (saturated): 매 information X.
  • LLM-as-judge 만: 매 self-bias (GPT-4 가 GPT-4 의 favor).
  • No internal eval: 매 task-specific gap 의 miss.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — benchmark catalog + contamination + 매 lm-eval / HELM code