f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
186 lines
6.2 KiB
Markdown
186 lines
6.2 KiB
Markdown
---
|
|
id: wiki-2026-0508-rouge-metrics
|
|
title: ROUGE Metrics
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [ROUGE, ROUGE-1, ROUGE-2, ROUGE-L, Recall-Oriented Understudy for Gisting Evaluation]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [nlp, evaluation, summarization, metric]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: rouge-score
|
|
---
|
|
|
|
# ROUGE Metrics
|
|
|
|
## 매 한 줄
|
|
> **"매 n-gram overlap recall 의 summary"**. Lin 2004 의 summarization eval 의 classic. 매 R-1 (unigram), R-2 (bigram), R-L (longest common subsequence) 의 trio. 2026 현재 매 still-default reference baseline, 매 supplemented by BERTScore / BARTScore / LLM-judge for semantic eval.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 ROUGE variants
|
|
- **ROUGE-N**: n-gram overlap recall. R-1 (unigram), R-2 (bigram).
|
|
- **ROUGE-L**: Longest Common Subsequence (LCS) — 매 capture sentence-level structure, allow gap.
|
|
- **ROUGE-W**: weighted LCS (consecutive 의 prefer).
|
|
- **ROUGE-S / ROUGE-SU**: skip-bigram (with unigram) — 매 capture skip patterns.
|
|
- **ROUGE-Lsum**: 매 summary-level — sentence-tokenize 후 매 sentence 의 LCS sum.
|
|
|
|
### 매 formula
|
|
- Recall = matches / |reference n-grams|
|
|
- Precision = matches / |candidate n-grams|
|
|
- F1 = 2·P·R / (P+R)
|
|
- 매 original paper 의 recall focused, 매 modern usage 의 F1 reported.
|
|
|
|
### 매 vs BLEU
|
|
- BLEU: precision-oriented, machine translation. ROUGE: recall-oriented, summarization.
|
|
- BLEU 의 brevity penalty, ROUGE 의 X (recall handles).
|
|
- BLEU 의 corpus-level geometric mean, ROUGE 의 typically per-example.
|
|
|
|
### 매 limitations
|
|
- 매 surface-level: synonym / paraphrase 의 penalize.
|
|
- 매 fluency / factuality 의 capture X.
|
|
- 매 tokenization sensitivity (BPE vs word).
|
|
- 매 reference-dependent: 1 reference 의 high variance.
|
|
|
|
### 매 응용
|
|
1. Summarization eval (CNN/DM, XSum, Gigaword).
|
|
2. Long-doc summarization (arXiv, GovReport, BookSum 2026).
|
|
3. RAG answer eval (vs gold answer).
|
|
4. LLM eval reporting (still common alongside BERTScore + LLM-judge).
|
|
|
|
## 💻 패턴
|
|
|
|
### rouge-score basic
|
|
```python
|
|
from rouge_score import rouge_scorer
|
|
|
|
scorer = rouge_scorer.RougeScorer(
|
|
["rouge1", "rouge2", "rougeL", "rougeLsum"],
|
|
use_stemmer=True,
|
|
)
|
|
ref = "the cat sat on the mat in the morning"
|
|
hyp = "a cat was sitting on the mat"
|
|
scores = scorer.score(ref, hyp)
|
|
# {'rouge1': Score(precision=0.71, recall=0.55, fmeasure=0.62), ...}
|
|
print(scores["rougeL"].fmeasure)
|
|
```
|
|
|
|
### HuggingFace evaluate
|
|
```python
|
|
import evaluate
|
|
rouge = evaluate.load("rouge")
|
|
results = rouge.compute(
|
|
predictions=["the cat sat on the mat"],
|
|
references=["a cat was sitting on the mat"],
|
|
use_stemmer=True,
|
|
use_aggregator=True,
|
|
)
|
|
# {'rouge1': 0.66, 'rouge2': 0.33, 'rougeL': 0.66, 'rougeLsum': 0.66}
|
|
```
|
|
|
|
### Batch eval on dataset
|
|
```python
|
|
from datasets import load_dataset
|
|
import evaluate
|
|
|
|
ds = load_dataset("cnn_dailymail", "3.0.0", split="test[:100]")
|
|
rouge = evaluate.load("rouge")
|
|
|
|
preds = [model.summarize(x["article"]) for x in ds]
|
|
refs = [x["highlights"] for x in ds]
|
|
|
|
results = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
|
|
print(f"R-1: {results['rouge1']:.3f} R-2: {results['rouge2']:.3f} R-L: {results['rougeL']:.3f}")
|
|
```
|
|
|
|
### Tokenization-aware (multilingual)
|
|
```python
|
|
# 매 default rouge-score 의 English-only stemmer
|
|
# 매 multilingual: pre-tokenize with sentencepiece / language tokenizer
|
|
from rouge_score.tokenizers import Tokenizer
|
|
|
|
class JaTokenizer(Tokenizer):
|
|
def tokenize(self, text):
|
|
import fugashi
|
|
tagger = fugashi.Tagger()
|
|
return [w.surface for w in tagger(text)]
|
|
|
|
scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], tokenizer=JaTokenizer())
|
|
```
|
|
|
|
### ROUGE alongside semantic metrics (2026 best practice)
|
|
```python
|
|
import evaluate
|
|
rouge = evaluate.load("rouge")
|
|
bertscore = evaluate.load("bertscore")
|
|
|
|
r = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
|
|
b = bertscore.compute(predictions=preds, references=refs, lang="en", model_type="microsoft/deberta-xlarge-mnli")
|
|
|
|
print(f"R-L: {r['rougeL']:.3f} BERTScore-F1: {sum(b['f1'])/len(b['f1']):.3f}")
|
|
```
|
|
|
|
### LLM-as-judge supplement
|
|
```python
|
|
from anthropic import Anthropic
|
|
client = Anthropic()
|
|
|
|
def judge(article, summary):
|
|
msg = client.messages.create(
|
|
model="claude-opus-4-7",
|
|
max_tokens=200,
|
|
messages=[{"role": "user", "content": f"""
|
|
Rate this summary 1-5 on faithfulness and coverage.
|
|
Article: {article}
|
|
Summary: {summary}
|
|
Output JSON: {{"faithfulness": int, "coverage": int}}
|
|
"""}],
|
|
)
|
|
return msg.content[0].text
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Quick lexical baseline | ROUGE-1 + ROUGE-L F1 |
|
|
| Summarization paper | report R-1/R-2/R-L all |
|
|
| Semantic eval needed | BERTScore + ROUGE both |
|
|
| Factuality matters | LLM-judge or QAGS, not ROUGE |
|
|
| Multilingual | language-specific tokenizer 의 plug |
|
|
| Production monitoring | ROUGE-L + BERTScore + sample LLM-judge |
|
|
|
|
**기본값**: 매 ROUGE-L F1 + ROUGE-1 F1 의 report, 매 supplement BERTScore for semantic, 매 LLM-judge for nuance. 매 ROUGE alone 의 1980-style — 2026 의 multi-metric.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Summarization]]
|
|
- 변형: [[ROUGE-L]]
|
|
- Adjacent: [[LLM-as-Judge]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: write eval scripts, explain ROUGE variants, generate reference summaries for testing.
|
|
**언제 X**: as the metric itself — 매 ROUGE 의 deterministic, no LLM needed. 매 use LLM-judge as separate complementary metric.
|
|
|
|
## ❌ 안티패턴
|
|
- **ROUGE-only eval**: synonym 의 penalize, miss semantic equivalence.
|
|
- **No stemming**: "running" vs "runs" 의 false negative.
|
|
- **Single reference**: high variance, prefer multi-reference (CNN/DM 의 1, but XSum 의 1, GovReport 의 1 — limitation).
|
|
- **ROUGE for QA / dialog**: 매 not summarization-shaped — use task-specific metric.
|
|
- **ROUGE for factuality**: 매 ROUGE 의 surface-only, 매 hallucinated summary 의 still high R if word-overlap.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Lin 2004 ACL "ROUGE: A Package for Automatic Evaluation of Summaries", rouge-score 0.1.x, HF evaluate 2026).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — ROUGE variants, rouge-score patterns, 2026 multi-metric guidance |
|