f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.9 KiB
6.9 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-llm-as-a-judge-laaj | LLM-as-a-Judge (LaaJ) | 10_Wiki/Topics | verified | self |
|
none | A | 0.93 | applied |
|
2026-05-10 | pending |
|
LLM-as-a-Judge (LaaJ)
매 한 줄
"매 LLM 의 의 의 의 evaluator 의 의 의 의 LLM output 의 score / compare". 매 cheaper 의 human eval. 매 famous: MT-Bench (Zheng 2023), AlpacaEval, G-Eval. 매 caveat: 매 bias (length, position, similar style).
매 핵심
매 use cases
- 매 model A vs B comparison.
- 매 quality score (0-10).
- 매 specific criteria check (helpful, harmless, factual).
- 매 RLHF preference data generation.
- 매 production monitoring.
매 known biases
- Position: 매 first answer favored.
- Length: 매 longer = better (often false).
- Style match: 매 similar style 의 favor.
- Self-preference: 매 same-family model output favor.
매 응용
- Eval LLM in production.
- Iterative prompt refinement.
- RLHF preference data.
- Benchmark.
💻 패턴
Pairwise judge (MT-Bench style)
def pairwise_judge(question, response_a, response_b, judge_llm):
prompt = f"""Compare two AI responses.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Output:
- winner: A | B | tie
- reason: 1 sentence"""
return judge_llm.generate(prompt)
Position bias mitigation (swap)
def fair_pairwise(q, a, b, judge):
r1 = pairwise_judge(q, a, b, judge)
r2 = pairwise_judge(q, b, a, judge) # 매 swap
if r1.winner == 'A' and r2.winner == 'B': return 'A wins both'
if r1.winner == 'B' and r2.winner == 'A': return 'B wins both'
return 'tie or position-biased'
Single-answer score (rubric)
def rubric_score(response, judge):
prompt = f"""Score 1-10 on:
- helpfulness
- correctness
- clarity
- safety
Response: {response}
Output JSON: {{ helpfulness: ..., correctness: ..., clarity: ..., safety: ..., overall: ... }}"""
return json.loads(judge.generate(prompt))
G-Eval (chain-of-thought judge, Liu 2023)
def g_eval(text, criterion, judge):
"""매 ask judge to reason 의 의 의 score."""
prompt = f"""Evaluate: {criterion}
Text: {text}
Reasoning step-by-step:
1. ...
2. ...
Final score (1-5): N"""
return judge.generate(prompt)
MT-Bench style
MT_BENCH_CATEGORIES = ['writing', 'roleplay', 'reasoning', 'math', 'coding', 'extraction', 'STEM', 'humanities']
def mt_bench_eval(model_a, model_b, judge):
questions = load_mt_bench()
scores = {'A': 0, 'B': 0, 'tie': 0}
for q in questions:
r_a = model_a.generate(q.prompt)
r_b = model_b.generate(q.prompt)
winner = fair_pairwise(q.prompt, r_a, r_b, judge)
scores[winner] += 1
return scores
AlpacaEval (vs reference)
def alpaca_eval(model, reference_model, judge, dataset):
wins = 0
for q in dataset:
ours = model.generate(q)
ref = reference_model.generate(q)
verdict = pairwise_judge(q, ours, ref, judge)
if verdict.winner == 'A': wins += 1
return wins / len(dataset) # 매 win rate
Length-controlled (mitigate length bias)
def length_normalize(score, response_length):
"""매 매 length 의 의 의 magnify score 의 detect."""
if response_length > 1000 and score > 8:
return score - 0.5 # 매 conservative adjust
return score
Cross-judge (multiple LLMs)
def cross_judge(q, a, b, judges):
"""매 매 different judge LLM 의 의 self-preference 의 reduce."""
votes = []
for judge in judges:
v = pairwise_judge(q, a, b, judge)
votes.append(v.winner)
return Counter(votes).most_common(1)[0][0]
Calibrate against human
def calibrate_judge(human_pairs, judge):
"""매 매 human label 의 매 judge 의 agree?"""
agreement = 0
for pair, human_winner in human_pairs:
judge_winner = pairwise_judge(pair.q, pair.a, pair.b, judge)
if judge_winner == human_winner: agreement += 1
return agreement / len(human_pairs)
# 매 > 0.8 = good
Constitutional principles judge
def constitutional_check(response, principles, judge):
violations = []
for p in principles:
verdict = judge.generate(f'Does this violate "{p}"? Yes/No.\n{response}')
if 'yes' in verdict.lower(): violations.append(p)
return violations
LLM-judge for RLHF data
def generate_preference_data(prompts, model, judge):
pairs = []
for p in prompts:
a = model.generate(p, temperature=0.7)
b = model.generate(p, temperature=0.7)
winner = pairwise_judge(p, a, b, judge)
pairs.append({'prompt': p, 'chosen': a if winner == 'A' else b, 'rejected': b if winner == 'A' else a})
return pairs # 매 → DPO training
Cost tracking
def cost_aware_eval(items, judge, max_cost=10):
cost = 0
for item in items:
if cost > max_cost: break
cost += judge_cost(item, judge)
score = judge.generate(...)
Prompt template
JUDGE_PROMPT_TEMPLATE: |
You are an impartial judge.
Evaluate the response on:
- Accuracy
- Helpfulness
- Safety
- Clarity
DO NOT be influenced by:
- Length (don't favor longer)
- Style (don't favor similar to your own)
- Position (treat A and B equally)
Question: {question}
Response A: {response_a}
Response B: {response_b}
Output JSON: { winner, reason, scores: { A: {...}, B: {...} } }
매 결정 기준
| 상황 | Approach |
|---|---|
| Quick eval | Pairwise + swap |
| Detailed | Rubric (G-Eval) |
| Production monitor | Single-answer score |
| RLHF data | Pairwise preferences |
| Cross-validate | Multiple judges |
기본값: 매 pairwise + swap + length-normalize + cross-judge for important + 매 calibrate against human sample + 매 cost cap.
🔗 Graph
- 변형: MT-Bench
- 응용: RLHF · DPO · Hallucination-in-LLMs
- Adjacent: Foundation-Models · Iterative Prompting · Best-of-N_Sampling
🤖 LLM 활용
언제: 매 LLM eval. 매 RLHF data. 매 monitoring. 언제 X: 매 ground-truth 가능 (use exact match).
❌ 안티패턴
- No swap: 매 position bias.
- Same family judge: 매 self-preference.
- No human calibration: 매 trust judge blindly.
- Single-shot judge: 매 noise.
- Ignore length effect: 매 length-bias.
🧪 검증 / 중복
- Verified (Zheng MT-Bench 2023, Liu G-Eval 2023, Dubois AlpacaEval).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — biases + 매 pairwise / G-Eval / MT-Bench / cross-judge code |