Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

7.8 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Benchmarks

📌 한 줄 통찰

"지능 의 줄자". 매 standardized 의 same comparison. 매 milestone + 매 marketing. 매 Goodhart's Law (매 metric 의 target 의 saturate). 매 modern era 의 contamination 의 worry.

📖 핵심

매 NLP / LLM benchmark

General reasoning

MMLU (57 subjects, multiple choice): 매 GPT 시대 의 standard.
MMLU-Pro (2024): 매 harder, 매 contamination 의 fix.
GPQA (graduate-level science): 매 hard.
BIG-Bench Hard: 매 LLM 의 weak point.
AGIEval: 매 SAT, GRE, LSAT.

Math

GSM8K (grade school math): 매 saturated.
MATH (competition): 매 hard.
AIME / IMO: 매 frontier.

Code

HumanEval (OpenAI): 매 saturated.
MBPP: 매 basic Python.
SWE-bench (Princeton): 매 real GitHub issue.
LiveCodeBench: 매 contamination-aware.

Instruction following

AlpacaEval / MT-Bench: 매 LLM-as-judge.
Arena (LMSYS): 매 human pairwise.
IFEval: 매 verifiable instruction.

Long context

Needle in Haystack: 매 retrieval.
RULER: 매 multi-task.
InfiniteBench.

Agentic / tool use

WebArena / GAIA: 매 real task.
OSWorld: 매 desktop GUI.
τ-bench (tau-bench): 매 customer service.

Safety / alignment

TruthfulQA: 매 honesty.
BBQ (bias QA).
HarmBench / AdvBench: 매 jailbreak.
MACHIAVELLI: 매 power-seeking.

매 vision benchmark

ImageNet: 매 classification.
COCO: 매 detection / segmentation.
VQAv2: 매 visual QA.
MMMU: 매 multi-modal MMLU.

매 problem

Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure."
매 saturated benchmark = 매 model 의 game.

Data contamination

매 pretraining data 의 매 test set leak.
매 LLM 의 fake high score.
→ 매 LiveCodeBench, 매 MMLU-Pro 의 mitigate.

Construct validity

매 measured ≠ 매 wanted.
매 MMLU = 매 multiple-choice (real ≠).

Distribution shift

매 academic ≠ 매 real-world.

Evaluation cost

매 GPT-4 의 evaluation 의 expensive.
매 LLM-as-judge 의 bias.

매 modern best practice

Multiple benchmark: 매 single 의 game 의 detect.
Held-out test: 매 fresh.
Contamination check: 매 n-gram match.
LLM-as-judge audit: 매 self-bias.
Human preference (Arena): 매 ground truth.
HELM (Stanford): 매 holistic, 매 multi-axis.
Specific task eval: 매 internal benchmark.

💻 패턴

lm-evaluation-harness (EleutherAI)

pip install lm-eval

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3-8B \
  --tasks mmlu,gsm8k,arc_challenge,truthfulqa \
  --device cuda \
  --batch_size 8

→ 매 standard 의 reproducible.

HELM (Stanford)

# 매 holistic evaluation
from helm.benchmark.run import run

scenarios = [
    'mmlu',
    'truthfulqa',
    'bbq',
    'real_toxicity_prompts',
    'civil_comments',
]
run(model='openai/gpt-4', scenarios=scenarios)

Custom internal benchmark

def evaluate_custom(model, test_cases):
    results = []
    for case in test_cases:
        response = model.generate(case.prompt)
        score = case.judge(response)  # 매 task-specific
        results.append({
            'case_id': case.id,
            'score': score,
            'response': response,
            'expected': case.expected,
        })
    
    # 매 metric breakdown
    by_category = group_by(results, 'category')
    for cat, items in by_category.items():
        print(f'{cat}: {sum(i["score"] for i in items)/len(items):.3f}')
    
    return results

LLM-as-judge (with calibration)

def llm_judge(prompt, response, reference):
    judge_prompt = f"""Compare the response against the reference.
Score 1-5 (5 = matches reference, 1 = wrong).

Prompt: {prompt}
Reference: {reference}
Response: {response}

Score: """
    
    # 매 N=5 의 average (variance reduce)
    scores = [parse_score(judge_model.generate(judge_prompt)) for _ in range(5)]
    return sum(scores) / len(scores)

Contamination check (n-gram)

def contamination_check(test_examples, pretrain_corpus, n=13):
    contaminated = 0
    for ex in test_examples:
        ngrams = set(get_ngrams(ex.text, n))
        for doc in pretrain_corpus.search(ngrams):
            if any(ng in doc for ng in ngrams):
                contaminated += 1
                break
    return contaminated / len(test_examples)

Pairwise human eval (Arena-style)

def pairwise_eval(model_a, model_b, prompts, n_judges=10):
    wins = {'a': 0, 'b': 0, 'tie': 0}
    for prompt in prompts:
        ra, rb = model_a.gen(prompt), model_b.gen(prompt)
        # 매 randomize order
        if random.random() < 0.5:
            r1, r2, label = ra, rb, 'a'
        else:
            r1, r2, label = rb, ra, 'b'
        
        votes = [human_judge(prompt, r1, r2) for _ in range(n_judges)]
        winner = majority(votes)
        if winner == 'tie': wins['tie'] += 1
        elif winner == '1': wins[label] += 1
        else: wins['a' if label == 'b' else 'b'] += 1
    return wins

Bradley-Terry (Elo) for LMSYS Arena

import numpy as np
from sklearn.linear_model import LogisticRegression

def fit_elo(matches, models):
    # matches: [(winner_idx, loser_idx), ...]
    X = np.zeros((len(matches), len(models)))
    y = np.ones(len(matches))
    for i, (w, l) in enumerate(matches):
        X[i, w] = 1
        X[i, l] = -1
    
    clf = LogisticRegression(fit_intercept=False).fit(X, y)
    # 매 elo = scaled coefficient
    return 400 / np.log(10) * clf.coef_[0] + 1000

🤔 결정 기준

목적	Benchmark
LLM general	MMLU-Pro + GPQA + Arena
Math	MATH + AIME
Code	SWE-bench + LiveCodeBench
Instruction	IFEval + AlpacaEval
Safety	TruthfulQA + HarmBench
Long context	RULER + Needle
Agentic	GAIA + WebArena
Multi-modal	MMMU
Internal	Custom (task-specific)

기본값: 매 multiple benchmark + 매 internal eval + 매 Arena 의 cross-check.

🔗 Graph

부모: Evaluation · ML-Metrics
변형: MMLU · HumanEval · SWE-bench · GLUE · ImageNet
응용: lm-evaluation-harness · HELM · OpenCompass · LMSYS-Arena
Adjacent: Goodharts-Law · Data-Contamination · LLM-as-Judge · Construct-Validity

🤖 LLM 활용

언제: 매 model selection. 매 fine-tune 효과 측정. 매 capability gap 의 identify. 언제 X: 매 single benchmark 의 비결로 의지. 매 contamination check 없이.

❌ 안티패턴

Single benchmark: 매 game 의 vulnerable.
Public test set 의 train: 매 contamination.
No Arena / human: 매 academic ≠ 매 real.
Stale benchmark (saturated): 매 information X.
LLM-as-judge 만: 매 self-bias (GPT-4 가 GPT-4 의 favor).
No internal eval: 매 task-specific gap 의 miss.

🧪 검증 / 중복

Verified (Stanford HELM, EleutherAI harness, LMSYS).
신뢰도 A.
Related: MMLU · Goodharts-Law · Data-Contamination · LLM-as-Judge.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — benchmark catalog + contamination + 매 lm-eval / HELM code

7.8 KiB Raw Blame History