Files
2nd/10_Wiki/Topics/AI_and_ML/LLM-as-a-Judge_LaaJ.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-llm-as-a-judge-laaj LLM-as-a-Judge (LaaJ) 10_Wiki/Topics verified self
LLM judge
LaaJ
AI eval
automated eval
MT-Bench
AlpacaEval
none A 0.93 applied
llm
evaluation
judge
automation
alpacaeval
mt-bench
2026-05-10 pending
language framework
Python Anthropic / OpenAI / G-Eval

LLM-as-a-Judge (LaaJ)

매 한 줄

"매 LLM 의 의 의 의 evaluator 의 의 의 의 LLM output 의 score / compare". 매 cheaper 의 human eval. 매 famous: MT-Bench (Zheng 2023), AlpacaEval, G-Eval. 매 caveat: 매 bias (length, position, similar style).

매 핵심

매 use cases

  • 매 model A vs B comparison.
  • 매 quality score (0-10).
  • 매 specific criteria check (helpful, harmless, factual).
  • 매 RLHF preference data generation.
  • 매 production monitoring.

매 known biases

  • Position: 매 first answer favored.
  • Length: 매 longer = better (often false).
  • Style match: 매 similar style 의 favor.
  • Self-preference: 매 same-family model output favor.

매 응용

  1. Eval LLM in production.
  2. Iterative prompt refinement.
  3. RLHF preference data.
  4. Benchmark.

💻 패턴

Pairwise judge (MT-Bench style)

def pairwise_judge(question, response_a, response_b, judge_llm):
    prompt = f"""Compare two AI responses.

Question: {question}

Response A: {response_a}
Response B: {response_b}

Output:
- winner: A | B | tie
- reason: 1 sentence"""
    return judge_llm.generate(prompt)

Position bias mitigation (swap)

def fair_pairwise(q, a, b, judge):
    r1 = pairwise_judge(q, a, b, judge)
    r2 = pairwise_judge(q, b, a, judge)  # 매 swap
    if r1.winner == 'A' and r2.winner == 'B': return 'A wins both'
    if r1.winner == 'B' and r2.winner == 'A': return 'B wins both'
    return 'tie or position-biased'

Single-answer score (rubric)

def rubric_score(response, judge):
    prompt = f"""Score 1-10 on:
- helpfulness
- correctness
- clarity
- safety

Response: {response}

Output JSON: {{ helpfulness: ..., correctness: ..., clarity: ..., safety: ..., overall: ... }}"""
    return json.loads(judge.generate(prompt))

G-Eval (chain-of-thought judge, Liu 2023)

def g_eval(text, criterion, judge):
    """매 ask judge to reason 의 의 의 score."""
    prompt = f"""Evaluate: {criterion}

Text: {text}

Reasoning step-by-step:
1. ...
2. ...

Final score (1-5): N"""
    return judge.generate(prompt)

MT-Bench style

MT_BENCH_CATEGORIES = ['writing', 'roleplay', 'reasoning', 'math', 'coding', 'extraction', 'STEM', 'humanities']

def mt_bench_eval(model_a, model_b, judge):
    questions = load_mt_bench()
    scores = {'A': 0, 'B': 0, 'tie': 0}
    for q in questions:
        r_a = model_a.generate(q.prompt)
        r_b = model_b.generate(q.prompt)
        winner = fair_pairwise(q.prompt, r_a, r_b, judge)
        scores[winner] += 1
    return scores

AlpacaEval (vs reference)

def alpaca_eval(model, reference_model, judge, dataset):
    wins = 0
    for q in dataset:
        ours = model.generate(q)
        ref = reference_model.generate(q)
        verdict = pairwise_judge(q, ours, ref, judge)
        if verdict.winner == 'A': wins += 1
    return wins / len(dataset)  # 매 win rate

Length-controlled (mitigate length bias)

def length_normalize(score, response_length):
    """매 매 length 의 의 의 magnify score 의 detect."""
    if response_length > 1000 and score > 8:
        return score - 0.5  # 매 conservative adjust
    return score

Cross-judge (multiple LLMs)

def cross_judge(q, a, b, judges):
    """매 매 different judge LLM 의 의 self-preference 의 reduce."""
    votes = []
    for judge in judges:
        v = pairwise_judge(q, a, b, judge)
        votes.append(v.winner)
    return Counter(votes).most_common(1)[0][0]

Calibrate against human

def calibrate_judge(human_pairs, judge):
    """매 매 human label 의 매 judge 의 agree?"""
    agreement = 0
    for pair, human_winner in human_pairs:
        judge_winner = pairwise_judge(pair.q, pair.a, pair.b, judge)
        if judge_winner == human_winner: agreement += 1
    return agreement / len(human_pairs)
# 매 > 0.8 = good

Constitutional principles judge

def constitutional_check(response, principles, judge):
    violations = []
    for p in principles:
        verdict = judge.generate(f'Does this violate "{p}"? Yes/No.\n{response}')
        if 'yes' in verdict.lower(): violations.append(p)
    return violations

LLM-judge for RLHF data

def generate_preference_data(prompts, model, judge):
    pairs = []
    for p in prompts:
        a = model.generate(p, temperature=0.7)
        b = model.generate(p, temperature=0.7)
        winner = pairwise_judge(p, a, b, judge)
        pairs.append({'prompt': p, 'chosen': a if winner == 'A' else b, 'rejected': b if winner == 'A' else a})
    return pairs  # 매 → DPO training

Cost tracking

def cost_aware_eval(items, judge, max_cost=10):
    cost = 0
    for item in items:
        if cost > max_cost: break
        cost += judge_cost(item, judge)
        score = judge.generate(...)

Prompt template

JUDGE_PROMPT_TEMPLATE: |
  You are an impartial judge.
  Evaluate the response on:
  - Accuracy
  - Helpfulness
  - Safety
  - Clarity
  
  DO NOT be influenced by:
  - Length (don't favor longer)
  - Style (don't favor similar to your own)
  - Position (treat A and B equally)
  
  Question: {question}
  Response A: {response_a}
  Response B: {response_b}
  
  Output JSON: { winner, reason, scores: { A: {...}, B: {...} } }

매 결정 기준

상황 Approach
Quick eval Pairwise + swap
Detailed Rubric (G-Eval)
Production monitor Single-answer score
RLHF data Pairwise preferences
Cross-validate Multiple judges

기본값: 매 pairwise + swap + length-normalize + cross-judge for important + 매 calibrate against human sample + 매 cost cap.

🔗 Graph

🤖 LLM 활용

언제: 매 LLM eval. 매 RLHF data. 매 monitoring. 언제 X: 매 ground-truth 가능 (use exact match).

안티패턴

  • No swap: 매 position bias.
  • Same family judge: 매 self-preference.
  • No human calibration: 매 trust judge blindly.
  • Single-shot judge: 매 noise.
  • Ignore length effect: 매 length-bias.

🧪 검증 / 중복

  • Verified (Zheng MT-Bench 2023, Liu G-Eval 2023, Dubois AlpacaEval).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — biases + 매 pairwise / G-Eval / MT-Bench / cross-judge code