[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,73 +1,342 @@
 ---
 id: wiki-2026-0508-ai-evaluation-benchmarks
-title: "AI Evaluation & Benchmarks"
+title: AI Evaluation & Benchmarks
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AUTO-EVBM-001]
+aliases: [LLM eval, model benchmark, MMLU, HumanEval, SWE-bench, Chatbot Arena, NIAH, RULER]
 duplicate_of: none
-source_trust_level: A
-confidence_score: 1.0
-tags: [auto-reinforced, ai-evaluation, benchmarks, niah, ruler, mmlu, lmsys, evaluation-metrics]
+source_trust_level: B
+confidence_score: 0.9
+verification_status: conceptual
+tags: [llm-eval, benchmark, mmlu, humaneval, swe-bench, chatbot-arena, niah, contamination, ai-quality]
 raw_sources: []
-last_reinforced: 2026-05-04
+last_reinforced: 2026-05-09
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+inferred_by: Claude Opus 4.7 (manual cleanup 2026-05-09)
+tech_stack:
+  language: Python / TS
+  framework: Promptfoo / LangSmith / Inspect / lm-eval-harness
 ---

-# [[AI Evaluation & Benchmarks|AI Evaluation & Benchmarks]]
+# AI Evaluation & Benchmarks

 ## 📌 한 줄 통찰 (The Karpathy Summary)
-> "지능의 척도: 모델의 성능을 단순히 '좋다'고 말하는 대신, 수학, 코딩, 상식, 그리고 백만 토큰 속에서의 기억력 등 정량적 지표를 통해 모델의 실질적인 체급을 측정하는 표준화된 시험지."
+> **"좋다" vs "측정"**. 매 capability (math, code, reasoning, long-context, tool use) 의 standardized test. 단점: contamination, Goodhart's law, eval ≠ real-world. Modern = LMSys Arena (human pref) + SWE-bench (real task) + custom domain eval.

 ## 📖 구조화된 지식 (Synthesized Content)
-AI 모델의 능력을 객관적으로 비교하고 한계를 파악하기 위한 표준화된 평가 지표들입니다.

-1.  **전통적 벤치마크**:
-    *   **MMLU (Massive Multitask Language Understanding)**: 인문학, 사회과학, 수학 등 57개 주제에 대한 지식을 측정하는 표준 시험.
-    *   **HumanEval / MBPP**: 모델의 파이썬 코드 생성 능력을 평가.
-    *   **GSM8K**: 초등학교 수준의 다단계 수학 문장제 문제 해결 능력 측정.
-2.  **롱 컨텍스트 벤치마크**:
-    *   **Needle In A Haystack (NIAH)**: 거대 문맥 속 특정 정보 검색 능력을 시각적 도표로 확인.
-    *   **RULER**: 단순 검색을 넘어 요약, 추론 등 복잡한 롱 컨텍스트 활용 능력을 종합 평가.
-3.  **실전 및 에이전트 평가**:
-    *   **LMSYS Chatbot Arena**: 실제 사용자들의 블라인드 테스트를 통한 엘로(Elo) 레이팅 시스템.
-    *   **MCP-Atlas**: [[Model Context Protocol (MCP)|MCP]]를 활용한 도구 통합 및 오케스트레이션 성능 측정.
-    *   **SWE-bench**: 실제 오픈소스 GitHub 이슈를 모델이 직접 해결할 수 있는지 측정.
+### Benchmark 의 family
+
+#### 1. Knowledge / 추론
+| Benchmark | 측정 | Note |
+|---|---|---|
+| **MMLU** (57 subject) | 다영역 지식 | 가장 인기. 90%+ saturated. |
+| **MMLU-Pro** | MMLU 확장, 더 어려움 | 50% 정도 가 frontier. |
+| **GPQA** | PhD-level science | 잘 saturated 안 됨. |
+| **HellaSwag** | 상식 추론 | 옛, saturated. |
+| **ARC-AGI** | Pattern reasoning | OpenAI o3 가 75% (인간 = 85%). |
+
+#### 2. Math
+| Benchmark | 측정 |
+|---|---|
+| **GSM8K** | 초등 multi-step | Saturated (95%+). |
+| **MATH** | 경시대회 problem | Frontier 70-90%. |
+| **AIME** | American math olympiad | Hard. o1/R1 가 잘. |
+| **FrontierMath** | Research-level | <5% saturate. |
+
+#### 3. Code
+| Benchmark | 측정 |
+|---|---|
+| **HumanEval** | Python 함수 생성 | Saturated (95%+). |
+| **MBPP** | Python coding | Saturated. |
+| **SWE-bench** | Real GitHub issue | Frontier ~50-60%. |
+| **SWE-bench Verified** | Curated subset | More reliable. |
+| **BigCodeBench** | Complex Python | Frontier ~30-50%. |
+| **LiveCodeBench** | Recent (LeetCode) | 매월 update (contamination 방지). |
+
+#### 4. Long context
+| Benchmark | 측정 |
+|---|---|
+| **NIAH (Needle in a Haystack)** | "needle" sentence 의 retrieval | Trivial 가 됨 — too easy. |
+| **RULER** | Multi-needle, summarize, multi-hop | More realistic. |
+| **LongBench** | Long doc QA |  |
+| **Loong** | Multi-doc reasoning |  |
+
+#### 5. Agent / tool
+| Benchmark | 측정 |
+|---|---|
+| **GAIA** | Real-world tasks (web, file) | Frontier ~30%. |
+| **SWE-bench** | Code agent | Devin / Cursor benchmark. |
+| **WebArena / VisualWebArena** | Browser agent | <30% saturate. |
+| **MCP-Atlas** | Tool composition |  |
+| **τ-bench** | Customer service simulation |  |
+
+#### 6. Real-world / human pref
+| Benchmark | 측정 |
+|---|---|
+| **LMSYS Chatbot Arena** | Blind A/B + Elo | Most trusted real-world signal. |
+| **MT-Bench** | Multi-turn quality (LLM-judge) |  |
+| **AlpacaEval** | LLM-judge |  |
+| **Vibes** | Subjective pref (community) |  |
+
+#### 7. Safety / alignment
+| Benchmark | 측정 |
+|---|---|
+| **TruthfulQA** | 거짓 안 말함 |  |
+| **HarmBench** | Refuse harmful |  |
+| **Anthropic Persuasion** |  |
+| **Constitutional AI eval** |  |
+
+### 함정 (Goodhart's Law in AI)
+1. **Contamination**: train data 가 benchmark 가 leak → 가짜 high score. 매 frontier model 의 의심.
+2. **Overfitting**: 매 release 의 specific benchmark optimization.
+3. **"솔루션 lookup"**: GSM8K 의 Q 가 train data 에. Model 가 reasoning X, retrieval.
+4. **Synthetic data 의 saturation**: 같은 LLM 가 만든 Q 의 같은 LLM 가 풀어.
+5. **Real-world ≠ benchmark**: high score + bad UX 의 흔함.
+6. **Subjective**: chatbot quality 의 measure 가 tricky.
+
+→ Benchmark 의 lifecycle: 새 → 의미 → saturated → 의미 X → retire.
+
+### 새 benchmark 의 trend
+- **Live / dynamic** (LiveCodeBench, ARC-AGI): 매월 update.
+- **Verified** (SWE-bench Verified): human-curated.
+- **Real task** (GAIA, τ-bench): 실제 work.
+- **Human pref** (Arena): hard to game.
+- **Domain-specific**: medical (MedQA), legal (LegalBench), scientific.
+
+## 💻 코드 패턴 (Code Patterns)
+
+### lm-eval-harness (EleutherAI 표준)
+```bash
+pip install lm-eval
+
+# Run benchmark
+lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-8B \
+    --tasks mmlu,gsm8k,humaneval \
+    --batch_size 8
+
+# 결과 = JSON
+```
+
+### Promptfoo (custom eval)
+```yaml
+# promptfooconfig.yaml
+prompts:
+  - 'Solve this math problem: {{problem}}'
+
+providers:
+  - openai:gpt-4o-mini
+  - anthropic:claude-haiku-4-5
+
+tests:
+  - vars:
+      problem: 'If a train travels 60 mph for 2 hours, how far?'
+    assert:
+      - type: contains
+        value: '120'
+```
+
+```bash
+promptfoo eval
+```
+
+### LangSmith eval
+```python
+from langsmith import Client
+from langchain.smith import RunEvalConfig
+
+client = Client()
+results = client.run_on_dataset(
+    dataset_name='math-questions',
+    llm_or_chain=chain,
+    evaluation=RunEvalConfig(evaluators=['qa', 'context_qa']),
+)
+```
+
+### LLM-as-judge
+```python
+def judge(question, answer, expected):
+    prompt = f'''
+Score the answer on 1-10 scale.
+
+Question: {question}
+Expected: {expected}
+Answer: {answer}
+
+Output JSON: {{"score": N, "reason": "..."}}
+'''
+    return json.loads(judge_llm.complete(prompt))
+```
+
+→ Cheap + scale. Bias 위험 (same model 이 자체 평가 가 bias).
+
+### Custom benchmark 작성
+```python
+import json
+
+# Golden set
+test_cases = [
+    {'input': 'What is 2+2?', 'expected': '4'},
+    {'input': 'Capital of France?', 'expected': 'Paris'},
+    # ... 100+
+]
+
+def evaluate(model):
+    correct = 0
+    for case in test_cases:
+        answer = model.complete(case['input'])
+        if match(answer, case['expected']):
+            correct += 1
+    return correct / len(test_cases)
+```
+
+### Inspect (UK AISI)
+```python
+from inspect_ai import Task, task, eval
+from inspect_ai.dataset import Sample
+from inspect_ai.scorer import match
+from inspect_ai.solver import generate
+
+@task
+def my_task():
+    return Task(
+        dataset=[
+            Sample(input='Capital of France?', target='Paris'),
+            Sample(input='What is 2+2?', target='4'),
+        ],
+        plan=[generate()],
+        scorer=match(),
+    )
+
+eval(my_task(), model='openai/gpt-4o-mini')
+```
+
+→ AISI / safety-focused.
+
+### Contamination check
+```python
+# n-gram overlap (낮은 = OK)
+def check_contamination(test_set, train_set, n=8):
+    train_ngrams = set()
+    for doc in train_set:
+        tokens = doc.split()
+        for i in range(len(tokens) - n + 1):
+            train_ngrams.add(tuple(tokens[i:i+n]))
+    
+    overlapping = 0
+    for q in test_set:
+        tokens = q.split()
+        for i in range(len(tokens) - n + 1):
+            if tuple(tokens[i:i+n]) in train_ngrams:
+                overlapping += 1
+                break
+    
+    return overlapping / len(test_set)
+```
+
+→ 5%+ overlap = 의심.
+
+### Domain-specific eval (예: 의료)
+```python
+# MedQA-style
+test = [
+    {
+        'q': 'Patient has fever, cough, fatigue. Most likely?',
+        'options': ['flu', 'covid', 'allergies', 'cancer'],
+        'correct': 'flu' or 'covid' (context-dep),
+    },
+]
+
+# Score = top-1 또는 top-2 accuracy.
+```
+
+### Continuous eval (production)
+```python
+@trace
+def chat(query):
+    response = llm.complete(query)
+    log({'query': query, 'response': response, 'tokens': ...})
+    return response
+
+# Daily:
+# 1. Sample 100 production query.
+# 2. LLM-judge score.
+# 3. Trend over time.
+```
+
+→ Drift detect.
+
+## 🤔 의사결정 기준 (Decision Criteria)
+
+| 작업 | Benchmark |
+|---|---|
+| Generic capability | MMLU + GSM8K + HumanEval |
+| Long context | RULER (NIAH 가 too easy) |
+| Real-world coding | SWE-bench Verified |
+| Real-world agent | GAIA / τ-bench |
+| Human-perceived quality | LMSys Arena Elo |
+| Math reasoning | AIME / FrontierMath |
+| Domain (의료, 법) | Domain-specific (MedQA, LegalBench) |
+| Production app | Custom golden set + LLM-judge |
+| Safety | TruthfulQA + HarmBench |
+
+**기본값**: Custom domain eval (production traffic) + Promptfoo CI gate. 매 release 의 regression 검증.

 ## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
-*   **데이터 오염 (Contamination)**: 평가 데이터가 모델의 학습 데이터에 포함되어, 실제 지능보다 점수가 높게 나오는 '암기형 점수' 문제가 심각합니다.
-*   **Goodhart's Law**: 지표가 목표가 되는 순간, 그 지표는 더 이상 좋은 지표가 아니게 됩니다. (점수만을 높이기 위한 편법 학습 성행)
+- **Saturation 빠름**: MMLU 90% saturated. 매 6 month 의 새 benchmark 필요.
+- **Real-world 차이**: high benchmark + bad UX 흔함. Production eval 가 더 중요.
+- **Contamination 의 epidemic**: 매 frontier model 의 의심. Live benchmark (LiveCodeBench) 가 답.
+- **Bench shopping**: vendor 가 자기 best benchmark 만 publish. 매 case 의 cherry-pick.
+- **Multi-modal**: text 만 X. Image (MMMU), video (Video-MME), audio.
+- **Reasoning trace 의 eval**: o1 / R1 의 chain-of-thought 의 quality 측정 = 새 challenge.

 ## 🔗 지식 연결 (Graph)
-*   **성능 관련**: [[LLM Capabilities|LLM Capabilities]], [[Reasoning Models|Reasoning Models]]
-*   **기술 관련**: [[Context Window & Long-Context LLMs|Context Window]], [[Tool Use & Function Calling|Tool Use]]
-
---
-*Last updated: 2026-05-04*
+- 부모: [[LLM-Capabilities]] · [[Model-Quality]] · [[ML-Eval-Methodology]]
+- 변형: [[Static-Benchmark]] · [[Live-Benchmark]] · [[Human-Pref-Eval]] · [[LLM-as-Judge]]
+- 응용: [[Continuous-Learning-System]] · [[Production-Drift-Detection]] · [[Domain-Specific-Eval]]
+- Adjacent: [[Contamination-Detection]] · [[Goodhart-Law-AI]] · [[Reasoning-Trace-Eval]]
+- Tools: lm-eval-harness · Promptfoo · LangSmith · Inspect (AISI) · Braintrust · Helicone · Langfuse
+- Related: [[Continuous-Learning-System]] · [[AI-Code-Agent-Patterns]] · [[Multi-Modal-Vision-Production]]

 ## 🤖 LLM 활용 힌트 (How to Use This Knowledge)

 **언제 이 지식을 쓰는가:**
- *(TODO)*
+- 새 LLM 의 quality 비교 (어떤 model 사용 결정).
+- Production system 의 release gate 의 eval 디자인.
+- 매 prompt 의 변경 시 regression 검증.
+- Domain-specific application 의 quality 측정.
+- Vendor 의 marketing claim 의 reality check.

 **언제 쓰면 안 되는가:**
- *(TODO)*
+- Benchmark 만 의존 (real user feedback 없이).
+- Single benchmark + decision (overfit risk).
+- Contaminated benchmark + 신뢰.
+- 비싼 frontier model 의 작은 task (overkill).
+- Domain eval 없이 generic 만 (production fail).
+
+## ❌ 안티패턴 (Anti-Patterns)
+- **Single benchmark + claim "best"**: cherry-pick. Multi-benchmark.
+- **Contamination check 안 함**: 가짜 score.
+- **Static benchmark + 매년**: saturation = 의미 X.
+- **No human eval**: LLM-judge 만 = bias.
+- **No production eval**: benchmark vs reality gap.
+- **Benchmark 가 train data**: model 의 dishonest.
+- **Eval cost 무시**: GPT-4 judge × 10k case = $$.
+- **Saturated benchmark 보고 model 의 ceiling 추정**: 매 model 의 ceiling 의 misjudge.

 ## 🧪 검증 상태 (Validation)
-
- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+- **정보 상태:** verified (concept-level).
+- **출처 신뢰도:** B (Hugging Face leaderboard, Stanford HAI report, Papers With Code).
+- **검토 이유:** Manual cleanup. 매 specific benchmark 의 number 가 매월 change. 매 6 month review 추천.

 ## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+- **기존 유사 문서:** [[LLM-Capabilities]] (related), [[Continuous-Learning-System]] (production eval), [[AI_Eval_Framework_Modern]] (tools).
+- **처리 방식:** KEEP (overview of benchmarks).
+- **처리 이유:** Tool / framework 와 의 separate. 매 benchmark 의 detail.

 ## 🕓 변경 이력 (Changelog)
-
 | 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
 |------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+| 2026-05-08 | P-Reinforce Phase 1 정규화 | UPDATE | A |
+| 2026-05-09 | Manual cleanup — code pattern + benchmark family + 의사결정 + 안티패턴 추가 | UPDATE | B |