Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

11 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, inferred_by, tech_stack

title

AI Evaluation & Benchmarks

📌 한 줄 통찰 (The Karpathy Summary)

"좋다" vs "측정". 매 capability (math, code, reasoning, long-context, tool use) 의 standardized test. 단점: contamination, Goodhart's law, eval ≠ real-world. Modern = LMSys Arena (human pref) + SWE-bench (real task) + custom domain eval.

📖 구조화된 지식 (Synthesized Content)

Benchmark 의 family

1. Knowledge / 추론

Benchmark	측정	Note
MMLU (57 subject)	다영역 지식	가장 인기. 90%+ saturated.
MMLU-Pro	MMLU 확장, 더 어려움	50% 정도 가 frontier.
GPQA	PhD-level science	잘 saturated 안 됨.
HellaSwag	상식 추론	옛, saturated.
ARC-AGI	Pattern reasoning	OpenAI o3 가 75% (인간 = 85%).

2. Math

Benchmark	측정
GSM8K	초등 multi-step
MATH	경시대회 problem
AIME	American math olympiad
FrontierMath	Research-level

3. Code

Benchmark	측정
HumanEval	Python 함수 생성
MBPP	Python coding
SWE-bench	Real GitHub issue
SWE-bench Verified	Curated subset
BigCodeBench	Complex Python
LiveCodeBench	Recent (LeetCode)

4. Long context

Benchmark	측정
NIAH (Needle in a Haystack)	"needle" sentence 의 retrieval
RULER	Multi-needle, summarize, multi-hop
LongBench	Long doc QA
Loong	Multi-doc reasoning

5. Agent / tool

Benchmark	측정
GAIA	Real-world tasks (web, file)
SWE-bench	Code agent
WebArena / VisualWebArena	Browser agent
MCP-Atlas	Tool composition
τ-bench	Customer service simulation

6. Real-world / human pref

Benchmark	측정
LMSYS Chatbot Arena	Blind A/B + Elo
MT-Bench	Multi-turn quality (LLM-judge)
AlpacaEval	LLM-judge
Vibes	Subjective pref (community)

7. Safety / alignment

Benchmark	측정
TruthfulQA	거짓 안 말함
HarmBench	Refuse harmful
Anthropic Persuasion
Constitutional AI eval

함정 (Goodhart's Law in AI)

Contamination: train data 가 benchmark 가 leak → 가짜 high score. 매 frontier model 의 의심.
Overfitting: 매 release 의 specific benchmark optimization.
"솔루션 lookup": GSM8K 의 Q 가 train data 에. Model 가 reasoning X, retrieval.
Synthetic data 의 saturation: 같은 LLM 가 만든 Q 의 같은 LLM 가 풀어.
Real-world ≠ benchmark: high score + bad UX 의 흔함.
Subjective: chatbot quality 의 measure 가 tricky.

→ Benchmark 의 lifecycle: 새 → 의미 → saturated → 의미 X → retire.

새 benchmark 의 trend

Live / dynamic (LiveCodeBench, ARC-AGI): 매월 update.
Verified (SWE-bench Verified): human-curated.
Real task (GAIA, τ-bench): 실제 work.
Human pref (Arena): hard to game.
Domain-specific: medical (MedQA), legal (LegalBench), scientific.

💻 코드 패턴 (Code Patterns)

lm-eval-harness (EleutherAI 표준)

pip install lm-eval

# Run benchmark
lm_eval --model hf --model_args pretrained=meta-llama/Llama-3-8B \
    --tasks mmlu,gsm8k,humaneval \
    --batch_size 8

# 결과 = JSON

Promptfoo (custom eval)

# promptfooconfig.yaml
prompts:
  - 'Solve this math problem: {{problem}}'

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-haiku-4-5

tests:
  - vars:
      problem: 'If a train travels 60 mph for 2 hours, how far?'
    assert:
      - type: contains
        value: '120'

promptfoo eval

LangSmith eval

from langsmith import Client
from langchain.smith import RunEvalConfig

client = Client()
results = client.run_on_dataset(
    dataset_name='math-questions',
    llm_or_chain=chain,
    evaluation=RunEvalConfig(evaluators=['qa', 'context_qa']),
)

LLM-as-judge

def judge(question, answer, expected):
    prompt = f'''
Score the answer on 1-10 scale.

Question: {question}
Expected: {expected}
Answer: {answer}

Output JSON: {{"score": N, "reason": "..."}}
'''
    return json.loads(judge_llm.complete(prompt))

→ Cheap + scale. Bias 위험 (same model 이 자체 평가 가 bias).

Custom benchmark 작성

import json

# Golden set
test_cases = [
    {'input': 'What is 2+2?', 'expected': '4'},
    {'input': 'Capital of France?', 'expected': 'Paris'},
    # ... 100+
]

def evaluate(model):
    correct = 0
    for case in test_cases:
        answer = model.complete(case['input'])
        if match(answer, case['expected']):
            correct += 1
    return correct / len(test_cases)

Inspect (UK AISI)

from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

@task
def my_task():
    return Task(
        dataset=[
            Sample(input='Capital of France?', target='Paris'),
            Sample(input='What is 2+2?', target='4'),
        ],
        plan=[generate()],
        scorer=match(),
    )

eval(my_task(), model='openai/gpt-4o-mini')

→ AISI / safety-focused.

Contamination check

# n-gram overlap (낮은 = OK)
def check_contamination(test_set, train_set, n=8):
    train_ngrams = set()
    for doc in train_set:
        tokens = doc.split()
        for i in range(len(tokens) - n + 1):
            train_ngrams.add(tuple(tokens[i:i+n]))
    
    overlapping = 0
    for q in test_set:
        tokens = q.split()
        for i in range(len(tokens) - n + 1):
            if tuple(tokens[i:i+n]) in train_ngrams:
                overlapping += 1
                break
    
    return overlapping / len(test_set)

→ 5%+ overlap = 의심.

Domain-specific eval (예: 의료)

# MedQA-style
test = [
    {
        'q': 'Patient has fever, cough, fatigue. Most likely?',
        'options': ['flu', 'covid', 'allergies', 'cancer'],
        'correct': 'flu' or 'covid' (context-dep),
    },
]

# Score = top-1 또는 top-2 accuracy.

Continuous eval (production)

@trace
def chat(query):
    response = llm.complete(query)
    log({'query': query, 'response': response, 'tokens': ...})
    return response

# Daily:
# 1. Sample 100 production query.
# 2. LLM-judge score.
# 3. Trend over time.

→ Drift detect.

🤔 의사결정 기준 (Decision Criteria)

작업	Benchmark
Generic capability	MMLU + GSM8K + HumanEval
Long context	RULER (NIAH 가 too easy)
Real-world coding	SWE-bench Verified
Real-world agent	GAIA / τ-bench
Human-perceived quality	LMSys Arena Elo
Math reasoning	AIME / FrontierMath
Domain (의료, 법)	Domain-specific (MedQA, LegalBench)
Production app	Custom golden set + LLM-judge
Safety	TruthfulQA + HarmBench

기본값: Custom domain eval (production traffic) + Promptfoo CI gate. 매 release 의 regression 검증.

⚠️ 모순 및 업데이트 (Contradictions & Updates)

Saturation 빠름: MMLU 90% saturated. 매 6 month 의 새 benchmark 필요.
Real-world 차이: high benchmark + bad UX 흔함. Production eval 가 더 중요.
Contamination 의 epidemic: 매 frontier model 의 의심. Live benchmark (LiveCodeBench) 가 답.
Bench shopping: vendor 가 자기 best benchmark 만 publish. 매 case 의 cherry-pick.
Multi-modal: text 만 X. Image (MMMU), video (Video-MME), audio.
Reasoning trace 의 eval: o1 / R1 의 chain-of-thought 의 quality 측정 = 새 challenge.

🔗 지식 연결 (Graph)

부모: LLM-Capabilities · Model-Quality · ML-Eval-Methodology
변형: Static-Benchmark · Live-Benchmark · Human-Pref-Eval · LLM-as-Judge
응용: Continuous-Learning-System · Production-Drift-Detection · Domain-Specific-Eval
Adjacent: Contamination-Detection · Goodhart-Law-AI · Reasoning-Trace-Eval
Tools: lm-eval-harness · Promptfoo · LangSmith · Inspect (AISI) · Braintrust · Helicone · Langfuse
Related: Continuous-Learning-System · AI-Code-Agent-Patterns · Multi-Modal-Vision-Production

🤖 LLM 활용 힌트 (How to Use This Knowledge)

언제 이 지식을 쓰는가:

새 LLM 의 quality 비교 (어떤 model 사용 결정).
Production system 의 release gate 의 eval 디자인.
매 prompt 의 변경 시 regression 검증.
Domain-specific application 의 quality 측정.
Vendor 의 marketing claim 의 reality check.

언제 쓰면 안 되는가:

Benchmark 만 의존 (real user feedback 없이).
Single benchmark + decision (overfit risk).
Contaminated benchmark + 신뢰.
비싼 frontier model 의 작은 task (overkill).
Domain eval 없이 generic 만 (production fail).

❌ 안티패턴 (Anti-Patterns)

Single benchmark + claim "best": cherry-pick. Multi-benchmark.
Contamination check 안 함: 가짜 score.
Static benchmark + 매년: saturation = 의미 X.
No human eval: LLM-judge 만 = bias.
No production eval: benchmark vs reality gap.
Benchmark 가 train data: model 의 dishonest.
Eval cost 무시: GPT-4 judge × 10k case = $$.
Saturated benchmark 보고 model 의 ceiling 추정: 매 model 의 ceiling 의 misjudge.

🧪 검증 상태 (Validation)

정보 상태: verified (concept-level).
출처 신뢰도: B (Hugging Face leaderboard, Stanford HAI report, Papers With Code).
검토 이유: Manual cleanup. 매 specific benchmark 의 number 가 매월 change. 매 6 month review 추천.

🧬 중복 검사 (Duplicate Check)

기존 유사 문서: LLM-Capabilities (related), Continuous-Learning-System (production eval), AI_Eval_Framework_Modern (tools).
처리 방식: KEEP (overview of benchmarks).
처리 이유: Tool / framework 와 의 separate. 매 benchmark 의 detail.

🕓 변경 이력 (Changelog)

날짜	변경 내용	처리 방식	신뢰도
2026-05-08	P-Reinforce Phase 1 정규화	UPDATE	A
2026-05-09	Manual cleanup — code pattern + benchmark family + 의사결정 + 안티패턴 추가	UPDATE	B

11 KiB Raw Blame History Unescape Escape