--- id: wiki-2026-0508-llm-as-a-judge-laaj title: LLM-as-a-Judge (LaaJ) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [LLM judge, LaaJ, AI eval, automated eval, MT-Bench, AlpacaEval] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [llm, evaluation, judge, automation, alpacaeval, mt-bench] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: Anthropic / OpenAI / G-Eval --- # LLM-as-a-Judge (LaaJ) ## 매 한 줄 > **"매 LLM 의 의 의 의 evaluator 의 의 의 의 LLM output 의 score / compare"**. 매 cheaper 의 human eval. 매 famous: MT-Bench (Zheng 2023), AlpacaEval, G-Eval. 매 caveat: 매 bias (length, position, similar style). ## 매 핵심 ### 매 use cases - 매 model A vs B comparison. - 매 quality score (0-10). - 매 specific criteria check (helpful, harmless, factual). - 매 RLHF preference data generation. - 매 production monitoring. ### 매 known biases - **Position**: 매 first answer favored. - **Length**: 매 longer = better (often false). - **Style match**: 매 similar style 의 favor. - **Self-preference**: 매 same-family model output favor. ### 매 응용 1. Eval LLM in production. 2. Iterative prompt refinement. 3. RLHF preference data. 4. Benchmark. ## 💻 패턴 ### Pairwise judge (MT-Bench style) ```python def pairwise_judge(question, response_a, response_b, judge_llm): prompt = f"""Compare two AI responses. Question: {question} Response A: {response_a} Response B: {response_b} Output: - winner: A | B | tie - reason: 1 sentence""" return judge_llm.generate(prompt) ``` ### Position bias mitigation (swap) ```python def fair_pairwise(q, a, b, judge): r1 = pairwise_judge(q, a, b, judge) r2 = pairwise_judge(q, b, a, judge) # 매 swap if r1.winner == 'A' and r2.winner == 'B': return 'A wins both' if r1.winner == 'B' and r2.winner == 'A': return 'B wins both' return 'tie or position-biased' ``` ### Single-answer score (rubric) ```python def rubric_score(response, judge): prompt = f"""Score 1-10 on: - helpfulness - correctness - clarity - safety Response: {response} Output JSON: {{ helpfulness: ..., correctness: ..., clarity: ..., safety: ..., overall: ... }}""" return json.loads(judge.generate(prompt)) ``` ### G-Eval (chain-of-thought judge, Liu 2023) ```python def g_eval(text, criterion, judge): """매 ask judge to reason 의 의 의 score.""" prompt = f"""Evaluate: {criterion} Text: {text} Reasoning step-by-step: 1. ... 2. ... Final score (1-5): N""" return judge.generate(prompt) ``` ### MT-Bench style ```python MT_BENCH_CATEGORIES = ['writing', 'roleplay', 'reasoning', 'math', 'coding', 'extraction', 'STEM', 'humanities'] def mt_bench_eval(model_a, model_b, judge): questions = load_mt_bench() scores = {'A': 0, 'B': 0, 'tie': 0} for q in questions: r_a = model_a.generate(q.prompt) r_b = model_b.generate(q.prompt) winner = fair_pairwise(q.prompt, r_a, r_b, judge) scores[winner] += 1 return scores ``` ### AlpacaEval (vs reference) ```python def alpaca_eval(model, reference_model, judge, dataset): wins = 0 for q in dataset: ours = model.generate(q) ref = reference_model.generate(q) verdict = pairwise_judge(q, ours, ref, judge) if verdict.winner == 'A': wins += 1 return wins / len(dataset) # 매 win rate ``` ### Length-controlled (mitigate length bias) ```python def length_normalize(score, response_length): """매 매 length 의 의 의 magnify score 의 detect.""" if response_length > 1000 and score > 8: return score - 0.5 # 매 conservative adjust return score ``` ### Cross-judge (multiple LLMs) ```python def cross_judge(q, a, b, judges): """매 매 different judge LLM 의 의 self-preference 의 reduce.""" votes = [] for judge in judges: v = pairwise_judge(q, a, b, judge) votes.append(v.winner) return Counter(votes).most_common(1)[0][0] ``` ### Calibrate against human ```python def calibrate_judge(human_pairs, judge): """매 매 human label 의 매 judge 의 agree?""" agreement = 0 for pair, human_winner in human_pairs: judge_winner = pairwise_judge(pair.q, pair.a, pair.b, judge) if judge_winner == human_winner: agreement += 1 return agreement / len(human_pairs) # 매 > 0.8 = good ``` ### Constitutional principles judge ```python def constitutional_check(response, principles, judge): violations = [] for p in principles: verdict = judge.generate(f'Does this violate "{p}"? Yes/No.\n{response}') if 'yes' in verdict.lower(): violations.append(p) return violations ``` ### LLM-judge for RLHF data ```python def generate_preference_data(prompts, model, judge): pairs = [] for p in prompts: a = model.generate(p, temperature=0.7) b = model.generate(p, temperature=0.7) winner = pairwise_judge(p, a, b, judge) pairs.append({'prompt': p, 'chosen': a if winner == 'A' else b, 'rejected': b if winner == 'A' else a}) return pairs # 매 → DPO training ``` ### Cost tracking ```python def cost_aware_eval(items, judge, max_cost=10): cost = 0 for item in items: if cost > max_cost: break cost += judge_cost(item, judge) score = judge.generate(...) ``` ### Prompt template ```yaml JUDGE_PROMPT_TEMPLATE: | You are an impartial judge. Evaluate the response on: - Accuracy - Helpfulness - Safety - Clarity DO NOT be influenced by: - Length (don't favor longer) - Style (don't favor similar to your own) - Position (treat A and B equally) Question: {question} Response A: {response_a} Response B: {response_b} Output JSON: { winner, reason, scores: { A: {...}, B: {...} } } ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Quick eval | Pairwise + swap | | Detailed | Rubric (G-Eval) | | Production monitor | Single-answer score | | RLHF data | Pairwise preferences | | Cross-validate | Multiple judges | **기본값**: 매 pairwise + swap + length-normalize + cross-judge for important + 매 calibrate against human sample + 매 cost cap. ## 🔗 Graph - 변형: [[MT-Bench]] - 응용: [[RLHF]] · [[DPO]] · [[Hallucination-in-LLMs]] - Adjacent: [[Foundation-Models]] · [[Iterative Prompting]] · [[Best-of-N_Sampling]] ## 🤖 LLM 활용 **언제**: 매 LLM eval. 매 RLHF data. 매 monitoring. **언제 X**: 매 ground-truth 가능 (use exact match). ## ❌ 안티패턴 - **No swap**: 매 position bias. - **Same family judge**: 매 self-preference. - **No human calibration**: 매 trust judge blindly. - **Single-shot judge**: 매 noise. - **Ignore length effect**: 매 length-bias. ## 🧪 검증 / 중복 - Verified (Zheng MT-Bench 2023, Liu G-Eval 2023, Dubois AlpacaEval). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — biases + 매 pairwise / G-Eval / MT-Bench / cross-judge code |