[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,94 +1,268 @@
 ---
 id: wiki-2026-0508-benchmarks
-title: Benchmarks
+title: Benchmarks (AI Evaluation)
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AUTO-BENC-001]
+aliases: [벤치마크, AI benchmarks, MMLU, HumanEval, MATH, GLUE, SuperGLUE, evaluation, leaderboard, Goodharts Law]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 0.98
-tags: [auto-reinforced, benchmarks, evaluation, performance-metrics, standardization, comparative-Analysis]
+confidence_score: 0.93
+verification_status: applied
+tags: [benchmark, evaluation, mmlu, humaneval, math, swe-bench, contamination, leaderboard, helm]
 raw_sources: []
-last_reinforced: 2026-04-20
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: Python
+  framework: lm-evaluation-harness / HELM / OpenCompass
 ---

-# [[Benchmarks|Benchmarks]]
+# Benchmarks

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "지능의 줄자: 서로 다른 시스템이나 알고리즘의 성능을 동일한 잣대로 비교하기 위해 설계된 표준화된 문제 세트이며, 기술 혁신의 이정표(Milestone)를 제시하는 경쟁의 마당."
+## 📌 한 줄 통찰
+> **"지능 의 줄자"**. 매 standardized 의 same comparison. 매 milestone + 매 marketing. 매 Goodhart's Law (매 metric 의 target 의 saturate). 매 modern era 의 contamination 의 worry.

-## 📖 구조화된 지식 (Synthesized Content)
-벤치마크(Benchmarks)는 특정 분야의 성능을 측정하고 비교하기 위한 지표이자 테스트 도구의 모음입니다.
+## 📖 핵심

-1.  **AI 분야의 주요 벤치마크**:
-    *   **ImageNet**: 이미지 인식 성능의 비약적 발전을 이끈 데이터셋.
-    *   **GLUE/SuperGLUE**: 자연어 이해 능력을 다각도로 평가하는 표준.
-    *   **MMLU**: 방대한 도메인 지식과 추론 능력을 종합적으로 평가 (최근 거대 모델 전쟁의 주전장).
-2.  **왜 중요한가?**:
-    *   객관적인 수치를 통해 기술의 한계를 명확히 하고, 연구자들이 집중해야 할 다음 목표(Next Challenge)를 정의함.
-3.  **위험 요소 (Goodhart's Law)**:
-    *   측정 지표가 목표가 되는 순간, 시스템은 본질적인 성능 향상보다 '시험 점수 따기(Benchmarking hacks)'에만 매몰될 수 있음.
+### 매 NLP / LLM benchmark

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌**: 과거에는 정적인 데이터셋(Static test) 위주의 정책이었으나, 현대 정책은 모델이 학습 데이터로 시험 문제를 미리 보게 되는 '데이터 오염(Contamination)' 리스크 정책에 대응하여 동적으로 변하는 벤치마크 정책으로 전환 중임(RL Update).
- **정책 변화(RL Update)**: 단순히 기술적 성능뿐만 아니라 윤리적 안정성과 유해성을 평가하는 'Safety Benchmark 정책'이 모델 배포의 필수 통과 관문이 됨.
+#### General reasoning
+- **MMLU** (57 subjects, multiple choice): 매 GPT 시대 의 standard.
+- **MMLU-Pro** (2024): 매 harder, 매 contamination 의 fix.
+- **GPQA** (graduate-level science): 매 hard.
+- **BIG-Bench Hard**: 매 LLM 의 weak point.
+- **AGIEval**: 매 SAT, GRE, LSAT.

-## 🔗 지식 연결 (Graph)
- [[Assessment|Assessment]], [[Algorithmic Fairness|Algorithmic Fairness]], Foundational Models, [[Ps-Reinforce|Ps-Reinforce]], [[Safety & Reliability|Safety & Reliability]]
- **Modern Tech/Tools**: Hugging Face Open LLM Leaderboard, HELM (Holistic Evaluation of Language Models).
---
+#### Math
+- **GSM8K** (grade school math): 매 saturated.
+- **MATH** (competition): 매 hard.
+- **AIME** / **IMO**: 매 frontier.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+#### Code
+- **HumanEval** (OpenAI): 매 saturated.
+- **MBPP**: 매 basic Python.
+- **SWE-bench** (Princeton): 매 real GitHub issue.
+- **LiveCodeBench**: 매 contamination-aware.

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+#### Instruction following
+- **AlpacaEval** / **MT-Bench**: 매 LLM-as-judge.
+- **Arena (LMSYS)**: 매 human pairwise.
+- **IFEval**: 매 verifiable instruction.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+#### Long context
+- **Needle in Haystack**: 매 retrieval.
+- **RULER**: 매 multi-task.
+- **InfiniteBench**.

-## 🧪 검증 상태 (Validation)
+#### Agentic / tool use
+- **WebArena** / **GAIA**: 매 real task.
+- **OSWorld**: 매 desktop GUI.
+- **τ-bench** (tau-bench): 매 customer service.

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+#### Safety / alignment
+- **TruthfulQA**: 매 honesty.
+- **BBQ** (bias QA).
+- **HarmBench** / **AdvBench**: 매 jailbreak.
+- **MACHIAVELLI**: 매 power-seeking.

-## 🧬 중복 검사 (Duplicate Check)
+### 매 vision benchmark
+- **ImageNet**: 매 classification.
+- **COCO**: 매 detection / segmentation.
+- **VQAv2**: 매 visual QA.
+- **MMMU**: 매 multi-modal MMLU.

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+### 매 problem

-## 🕓 변경 이력 (Changelog)
+#### Goodhart's Law
+- "When a measure becomes a target, it ceases to be a good measure."
+- 매 saturated benchmark = 매 model 의 game.

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+#### Data contamination
+- 매 pretraining data 의 매 test set leak.
+- 매 LLM 의 fake high score.
+- → 매 LiveCodeBench, 매 MMLU-Pro 의 mitigate.

-## 💻 코드 패턴 (Code Patterns)
+#### Construct validity
+- 매 measured ≠ 매 wanted.
+- 매 MMLU = 매 multiple-choice (real ≠).

-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
+#### Distribution shift
+- 매 academic ≠ 매 real-world.

-```text
-# TODO
+#### Evaluation cost
+- 매 GPT-4 의 evaluation 의 expensive.
+- 매 LLM-as-judge 의 bias.
+
+### 매 modern best practice
+1. **Multiple benchmark**: 매 single 의 game 의 detect.
+2. **Held-out test**: 매 fresh.
+3. **Contamination check**: 매 n-gram match.
+4. **LLM-as-judge audit**: 매 self-bias.
+5. **Human preference** (Arena): 매 ground truth.
+6. **HELM** (Stanford): 매 holistic, 매 multi-axis.
+7. **Specific task eval**: 매 internal benchmark.
+
+## 💻 패턴
+
+### lm-evaluation-harness (EleutherAI)
+```bash
+pip install lm-eval
+
+lm_eval --model hf \
+  --model_args pretrained=meta-llama/Llama-3-8B \
+  --tasks mmlu,gsm8k,arc_challenge,truthfulqa \
+  --device cuda \
+  --batch_size 8
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+→ 매 standard 의 reproducible.

-**선택 A를 써야 할 때:**
- *(TODO)*
+### HELM (Stanford)
+```python
+# 매 holistic evaluation
+from helm.benchmark.run import run

-**선택 B를 써야 할 때:**
- *(TODO)*
+scenarios = [
+    'mmlu',
+    'truthfulqa',
+    'bbq',
+    'real_toxicity_prompts',
+    'civil_comments',
+]
+run(model='openai/gpt-4', scenarios=scenarios)
+```

-**기본값:**
-> *(TODO)*
+### Custom internal benchmark
+```python
+def evaluate_custom(model, test_cases):
+    results = []
+    for case in test_cases:
+        response = model.generate(case.prompt)
+        score = case.judge(response)  # 매 task-specific
+        results.append({
+            'case_id': case.id,
+            'score': score,
+            'response': response,
+            'expected': case.expected,
+        })
+    
+    # 매 metric breakdown
+    by_category = group_by(results, 'category')
+    for cat, items in by_category.items():
+        print(f'{cat}: {sum(i["score"] for i in items)/len(items):.3f}')
+    
+    return results
+```

-## ❌ 안티패턴 (Anti-Patterns)
+### LLM-as-judge (with calibration)
+```python
+def llm_judge(prompt, response, reference):
+    judge_prompt = f"""Compare the response against the reference.
+Score 1-5 (5 = matches reference, 1 = wrong).

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+Prompt: {prompt}
+Reference: {reference}
+Response: {response}
+
+Score: """
+    
+    # 매 N=5 의 average (variance reduce)
+    scores = [parse_score(judge_model.generate(judge_prompt)) for _ in range(5)]
+    return sum(scores) / len(scores)
+```
+
+### Contamination check (n-gram)
+```python
+def contamination_check(test_examples, pretrain_corpus, n=13):
+    contaminated = 0
+    for ex in test_examples:
+        ngrams = set(get_ngrams(ex.text, n))
+        for doc in pretrain_corpus.search(ngrams):
+            if any(ng in doc for ng in ngrams):
+                contaminated += 1
+                break
+    return contaminated / len(test_examples)
+```
+
+### Pairwise human eval (Arena-style)
+```python
+def pairwise_eval(model_a, model_b, prompts, n_judges=10):
+    wins = {'a': 0, 'b': 0, 'tie': 0}
+    for prompt in prompts:
+        ra, rb = model_a.gen(prompt), model_b.gen(prompt)
+        # 매 randomize order
+        if random.random() < 0.5:
+            r1, r2, label = ra, rb, 'a'
+        else:
+            r1, r2, label = rb, ra, 'b'
+        
+        votes = [human_judge(prompt, r1, r2) for _ in range(n_judges)]
+        winner = majority(votes)
+        if winner == 'tie': wins['tie'] += 1
+        elif winner == '1': wins[label] += 1
+        else: wins['a' if label == 'b' else 'b'] += 1
+    return wins
+```
+
+### Bradley-Terry (Elo) for LMSYS Arena
+```python
+import numpy as np
+from sklearn.linear_model import LogisticRegression
+
+def fit_elo(matches, models):
+    # matches: [(winner_idx, loser_idx), ...]
+    X = np.zeros((len(matches), len(models)))
+    y = np.ones(len(matches))
+    for i, (w, l) in enumerate(matches):
+        X[i, w] = 1
+        X[i, l] = -1
+    
+    clf = LogisticRegression(fit_intercept=False).fit(X, y)
+    # 매 elo = scaled coefficient
+    return 400 / np.log(10) * clf.coef_[0] + 1000
+```
+
+## 🤔 결정 기준
+| 목적 | Benchmark |
+|---|---|
+| LLM general | MMLU-Pro + GPQA + Arena |
+| Math | MATH + AIME |
+| Code | SWE-bench + LiveCodeBench |
+| Instruction | IFEval + AlpacaEval |
+| Safety | TruthfulQA + HarmBench |
+| Long context | RULER + Needle |
+| Agentic | GAIA + WebArena |
+| Multi-modal | MMMU |
+| Internal | Custom (task-specific) |
+
+**기본값**: 매 multiple benchmark + 매 internal eval + 매 Arena 의 cross-check.
+
+## 🔗 Graph
+- 부모: [[Evaluation]] · [[ML-Metrics]]
+- 변형: [[MMLU]] · [[HumanEval]] · [[SWE-bench]] · [[GLUE]] · [[ImageNet]]
+- 응용: [[lm-evaluation-harness]] · [[HELM]] · [[OpenCompass]] · [[LMSYS-Arena]]
+- Adjacent: [[Goodharts-Law]] · [[Data-Contamination]] · [[LLM-as-Judge]] · [[Construct-Validity]]
+
+## 🤖 LLM 활용
+**언제**: 매 model selection. 매 fine-tune 효과 측정. 매 capability gap 의 identify.
+**언제 X**: 매 single benchmark 의 비결로 의지. 매 contamination check 없이.
+
+## ❌ 안티패턴
+- **Single benchmark**: 매 game 의 vulnerable.
+- **Public test set 의 train**: 매 contamination.
+- **No Arena / human**: 매 academic ≠ 매 real.
+- **Stale benchmark** (saturated): 매 information X.
+- **LLM-as-judge 만**: 매 self-bias (GPT-4 가 GPT-4 의 favor).
+- **No internal eval**: 매 task-specific gap 의 miss.
+
+## 🧪 검증 / 중복
+- Verified (Stanford HELM, EleutherAI harness, LMSYS).
+- 신뢰도 A.
+- Related: [[MMLU]] · [[Goodharts-Law]] · [[Data-Contamination]] · [[LLM-as-Judge]].
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — benchmark catalog + contamination + 매 lm-eval / HELM code |