---
id: wiki-2026-0508-llm-as-a-judge-laaj
title: LLM-as-a-Judge (LaaJ)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [LLM judge, LaaJ, AI eval, automated eval, MT-Bench, AlpacaEval]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [llm, evaluation, judge, automation, alpacaeval, mt-bench]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: Anthropic / OpenAI / G-Eval
---

# LLM-as-a-Judge (LaaJ)

## 매 한 줄
> **"매 LLM 의 의 의 의 evaluator 의 의 의 의 LLM output 의 score / compare"**. 매 cheaper 의 human eval. 매 famous: MT-Bench (Zheng 2023), AlpacaEval, G-Eval. 매 caveat: 매 bias (length, position, similar style).

## 매 핵심

### 매 use cases
- 매 model A vs B comparison.
- 매 quality score (0-10).
- 매 specific criteria check (helpful, harmless, factual).
- 매 RLHF preference data generation.
- 매 production monitoring.

### 매 known biases
- **Position**: 매 first answer favored.
- **Length**: 매 longer = better (often false).
- **Style match**: 매 similar style 의 favor.
- **Self-preference**: 매 same-family model output favor.

### 매 응용
1. Eval LLM in production.
2. Iterative prompt refinement.
3. RLHF preference data.
4. Benchmark.

## 💻 패턴

### Pairwise judge (MT-Bench style)
```python
def pairwise_judge(question, response_a, response_b, judge_llm):
    prompt = f"""Compare two AI responses.

Question: {question}

Response A: {response_a}
Response B: {response_b}

Output:
- winner: A | B | tie
- reason: 1 sentence"""
    return judge_llm.generate(prompt)
```

### Position bias mitigation (swap)
```python
def fair_pairwise(q, a, b, judge):
    r1 = pairwise_judge(q, a, b, judge)
    r2 = pairwise_judge(q, b, a, judge)  # 매 swap
    if r1.winner == 'A' and r2.winner == 'B': return 'A wins both'
    if r1.winner == 'B' and r2.winner == 'A': return 'B wins both'
    return 'tie or position-biased'
```

### Single-answer score (rubric)
```python
def rubric_score(response, judge):
    prompt = f"""Score 1-10 on:
- helpfulness
- correctness
- clarity
- safety

Response: {response}

Output JSON: {{ helpfulness: ..., correctness: ..., clarity: ..., safety: ..., overall: ... }}"""
    return json.loads(judge.generate(prompt))
```

### G-Eval (chain-of-thought judge, Liu 2023)
```python
def g_eval(text, criterion, judge):
    """매 ask judge to reason 의 의 의 score."""
    prompt = f"""Evaluate: {criterion}

Text: {text}

Reasoning step-by-step:
1. ...
2. ...

Final score (1-5): N"""
    return judge.generate(prompt)
```

### MT-Bench style
```python
MT_BENCH_CATEGORIES = ['writing', 'roleplay', 'reasoning', 'math', 'coding', 'extraction', 'STEM', 'humanities']

def mt_bench_eval(model_a, model_b, judge):
    questions = load_mt_bench()
    scores = {'A': 0, 'B': 0, 'tie': 0}
    for q in questions:
        r_a = model_a.generate(q.prompt)
        r_b = model_b.generate(q.prompt)
        winner = fair_pairwise(q.prompt, r_a, r_b, judge)
        scores[winner] += 1
    return scores
```

### AlpacaEval (vs reference)
```python
def alpaca_eval(model, reference_model, judge, dataset):
    wins = 0
    for q in dataset:
        ours = model.generate(q)
        ref = reference_model.generate(q)
        verdict = pairwise_judge(q, ours, ref, judge)
        if verdict.winner == 'A': wins += 1
    return wins / len(dataset)  # 매 win rate
```

### Length-controlled (mitigate length bias)
```python
def length_normalize(score, response_length):
    """매 매 length 의 의 의 magnify score 의 detect."""
    if response_length > 1000 and score > 8:
        return score - 0.5  # 매 conservative adjust
    return score
```

### Cross-judge (multiple LLMs)
```python
def cross_judge(q, a, b, judges):
    """매 매 different judge LLM 의 의 self-preference 의 reduce."""
    votes = []
    for judge in judges:
        v = pairwise_judge(q, a, b, judge)
        votes.append(v.winner)
    return Counter(votes).most_common(1)[0][0]
```

### Calibrate against human
```python
def calibrate_judge(human_pairs, judge):
    """매 매 human label 의 매 judge 의 agree?"""
    agreement = 0
    for pair, human_winner in human_pairs:
        judge_winner = pairwise_judge(pair.q, pair.a, pair.b, judge)
        if judge_winner == human_winner: agreement += 1
    return agreement / len(human_pairs)
# 매 > 0.8 = good
```

### Constitutional principles judge
```python
def constitutional_check(response, principles, judge):
    violations = []
    for p in principles:
        verdict = judge.generate(f'Does this violate "{p}"? Yes/No.\n{response}')
        if 'yes' in verdict.lower(): violations.append(p)
    return violations
```

### LLM-judge for RLHF data
```python
def generate_preference_data(prompts, model, judge):
    pairs = []
    for p in prompts:
        a = model.generate(p, temperature=0.7)
        b = model.generate(p, temperature=0.7)
        winner = pairwise_judge(p, a, b, judge)
        pairs.append({'prompt': p, 'chosen': a if winner == 'A' else b, 'rejected': b if winner == 'A' else a})
    return pairs  # 매 → DPO training
```

### Cost tracking
```python
def cost_aware_eval(items, judge, max_cost=10):
    cost = 0
    for item in items:
        if cost > max_cost: break
        cost += judge_cost(item, judge)
        score = judge.generate(...)
```

### Prompt template
```yaml
JUDGE_PROMPT_TEMPLATE: |
  You are an impartial judge.
  Evaluate the response on:
  - Accuracy
  - Helpfulness
  - Safety
  - Clarity
  
  DO NOT be influenced by:
  - Length (don't favor longer)
  - Style (don't favor similar to your own)
  - Position (treat A and B equally)
  
  Question: {question}
  Response A: {response_a}
  Response B: {response_b}
  
  Output JSON: { winner, reason, scores: { A: {...}, B: {...} } }
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Quick eval | Pairwise + swap |
| Detailed | Rubric (G-Eval) |
| Production monitor | Single-answer score |
| RLHF data | Pairwise preferences |
| Cross-validate | Multiple judges |

**기본값**: 매 pairwise + swap + length-normalize + cross-judge for important + 매 calibrate against human sample + 매 cost cap.

## 🔗 Graph
- 변형: [[MT-Bench]]
- 응용: [[RLHF]] · [[DPO]] · [[Hallucination-in-LLMs]]
- Adjacent: [[Foundation-Models]] · [[Iterative Prompting]] · [[Best-of-N_Sampling]]

## 🤖 LLM 활용
**언제**: 매 LLM eval. 매 RLHF data. 매 monitoring.
**언제 X**: 매 ground-truth 가능 (use exact match).

## ❌ 안티패턴
- **No swap**: 매 position bias.
- **Same family judge**: 매 self-preference.
- **No human calibration**: 매 trust judge blindly.
- **Single-shot judge**: 매 noise.
- **Ignore length effect**: 매 length-bias.

## 🧪 검증 / 중복
- Verified (Zheng MT-Bench 2023, Liu G-Eval 2023, Dubois AlpacaEval).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — biases + 매 pairwise / G-Eval / MT-Bench / cross-judge code |