--- id: wiki-2026-0508-rouge-metrics title: ROUGE Metrics category: 10_Wiki/Topics status: verified canonical_id: self aliases: [ROUGE, ROUGE-1, ROUGE-2, ROUGE-L, Recall-Oriented Understudy for Gisting Evaluation] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [nlp, evaluation, summarization, metric] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: rouge-score --- # ROUGE Metrics ## 매 한 줄 > **"매 n-gram overlap recall 의 summary"**. Lin 2004 의 summarization eval 의 classic. 매 R-1 (unigram), R-2 (bigram), R-L (longest common subsequence) 의 trio. 2026 현재 매 still-default reference baseline, 매 supplemented by BERTScore / BARTScore / LLM-judge for semantic eval. ## 매 핵심 ### 매 ROUGE variants - **ROUGE-N**: n-gram overlap recall. R-1 (unigram), R-2 (bigram). - **ROUGE-L**: Longest Common Subsequence (LCS) — 매 capture sentence-level structure, allow gap. - **ROUGE-W**: weighted LCS (consecutive 의 prefer). - **ROUGE-S / ROUGE-SU**: skip-bigram (with unigram) — 매 capture skip patterns. - **ROUGE-Lsum**: 매 summary-level — sentence-tokenize 후 매 sentence 의 LCS sum. ### 매 formula - Recall = matches / |reference n-grams| - Precision = matches / |candidate n-grams| - F1 = 2·P·R / (P+R) - 매 original paper 의 recall focused, 매 modern usage 의 F1 reported. ### 매 vs BLEU - BLEU: precision-oriented, machine translation. ROUGE: recall-oriented, summarization. - BLEU 의 brevity penalty, ROUGE 의 X (recall handles). - BLEU 의 corpus-level geometric mean, ROUGE 의 typically per-example. ### 매 limitations - 매 surface-level: synonym / paraphrase 의 penalize. - 매 fluency / factuality 의 capture X. - 매 tokenization sensitivity (BPE vs word). - 매 reference-dependent: 1 reference 의 high variance. ### 매 응용 1. Summarization eval (CNN/DM, XSum, Gigaword). 2. Long-doc summarization (arXiv, GovReport, BookSum 2026). 3. RAG answer eval (vs gold answer). 4. LLM eval reporting (still common alongside BERTScore + LLM-judge). ## 💻 패턴 ### rouge-score basic ```python from rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer( ["rouge1", "rouge2", "rougeL", "rougeLsum"], use_stemmer=True, ) ref = "the cat sat on the mat in the morning" hyp = "a cat was sitting on the mat" scores = scorer.score(ref, hyp) # {'rouge1': Score(precision=0.71, recall=0.55, fmeasure=0.62), ...} print(scores["rougeL"].fmeasure) ``` ### HuggingFace evaluate ```python import evaluate rouge = evaluate.load("rouge") results = rouge.compute( predictions=["the cat sat on the mat"], references=["a cat was sitting on the mat"], use_stemmer=True, use_aggregator=True, ) # {'rouge1': 0.66, 'rouge2': 0.33, 'rougeL': 0.66, 'rougeLsum': 0.66} ``` ### Batch eval on dataset ```python from datasets import load_dataset import evaluate ds = load_dataset("cnn_dailymail", "3.0.0", split="test[:100]") rouge = evaluate.load("rouge") preds = [model.summarize(x["article"]) for x in ds] refs = [x["highlights"] for x in ds] results = rouge.compute(predictions=preds, references=refs, use_stemmer=True) print(f"R-1: {results['rouge1']:.3f} R-2: {results['rouge2']:.3f} R-L: {results['rougeL']:.3f}") ``` ### Tokenization-aware (multilingual) ```python # 매 default rouge-score 의 English-only stemmer # 매 multilingual: pre-tokenize with sentencepiece / language tokenizer from rouge_score.tokenizers import Tokenizer class JaTokenizer(Tokenizer): def tokenize(self, text): import fugashi tagger = fugashi.Tagger() return [w.surface for w in tagger(text)] scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], tokenizer=JaTokenizer()) ``` ### ROUGE alongside semantic metrics (2026 best practice) ```python import evaluate rouge = evaluate.load("rouge") bertscore = evaluate.load("bertscore") r = rouge.compute(predictions=preds, references=refs, use_stemmer=True) b = bertscore.compute(predictions=preds, references=refs, lang="en", model_type="microsoft/deberta-xlarge-mnli") print(f"R-L: {r['rougeL']:.3f} BERTScore-F1: {sum(b['f1'])/len(b['f1']):.3f}") ``` ### LLM-as-judge supplement ```python from anthropic import Anthropic client = Anthropic() def judge(article, summary): msg = client.messages.create( model="claude-opus-4-7", max_tokens=200, messages=[{"role": "user", "content": f""" Rate this summary 1-5 on faithfulness and coverage. Article: {article} Summary: {summary} Output JSON: {{"faithfulness": int, "coverage": int}} """}], ) return msg.content[0].text ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Quick lexical baseline | ROUGE-1 + ROUGE-L F1 | | Summarization paper | report R-1/R-2/R-L all | | Semantic eval needed | BERTScore + ROUGE both | | Factuality matters | LLM-judge or QAGS, not ROUGE | | Multilingual | language-specific tokenizer 의 plug | | Production monitoring | ROUGE-L + BERTScore + sample LLM-judge | **기본값**: 매 ROUGE-L F1 + ROUGE-1 F1 의 report, 매 supplement BERTScore for semantic, 매 LLM-judge for nuance. 매 ROUGE alone 의 1980-style — 2026 의 multi-metric. ## 🔗 Graph - 부모: [[Summarization]] - 변형: [[ROUGE-L]] - Adjacent: [[LLM-as-Judge]] ## 🤖 LLM 활용 **언제**: write eval scripts, explain ROUGE variants, generate reference summaries for testing. **언제 X**: as the metric itself — 매 ROUGE 의 deterministic, no LLM needed. 매 use LLM-judge as separate complementary metric. ## ❌ 안티패턴 - **ROUGE-only eval**: synonym 의 penalize, miss semantic equivalence. - **No stemming**: "running" vs "runs" 의 false negative. - **Single reference**: high variance, prefer multi-reference (CNN/DM 의 1, but XSum 의 1, GovReport 의 1 — limitation). - **ROUGE for QA / dialog**: 매 not summarization-shaped — use task-specific metric. - **ROUGE for factuality**: 매 ROUGE 의 surface-only, 매 hallucinated summary 의 still high R if word-overlap. ## 🧪 검증 / 중복 - Verified (Lin 2004 ACL "ROUGE: A Package for Automatic Evaluation of Summaries", rouge-score 0.1.x, HF evaluate 2026). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — ROUGE variants, rouge-score patterns, 2026 multi-metric guidance |