Files
2nd/10_Wiki/Topics/AI_and_ML/ROUGE-Metrics.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-rouge-metrics ROUGE Metrics 10_Wiki/Topics verified self
ROUGE
ROUGE-1
ROUGE-2
ROUGE-L
Recall-Oriented Understudy for Gisting Evaluation
none A 0.9 applied
nlp
evaluation
summarization
metric
2026-05-10 pending
language framework
python rouge-score

ROUGE Metrics

매 한 줄

"매 n-gram overlap recall 의 summary". Lin 2004 의 summarization eval 의 classic. 매 R-1 (unigram), R-2 (bigram), R-L (longest common subsequence) 의 trio. 2026 현재 매 still-default reference baseline, 매 supplemented by BERTScore / BARTScore / LLM-judge for semantic eval.

매 핵심

매 ROUGE variants

  • ROUGE-N: n-gram overlap recall. R-1 (unigram), R-2 (bigram).
  • ROUGE-L: Longest Common Subsequence (LCS) — 매 capture sentence-level structure, allow gap.
  • ROUGE-W: weighted LCS (consecutive 의 prefer).
  • ROUGE-S / ROUGE-SU: skip-bigram (with unigram) — 매 capture skip patterns.
  • ROUGE-Lsum: 매 summary-level — sentence-tokenize 후 매 sentence 의 LCS sum.

매 formula

  • Recall = matches / |reference n-grams|
  • Precision = matches / |candidate n-grams|
  • F1 = 2·P·R / (P+R)
  • 매 original paper 의 recall focused, 매 modern usage 의 F1 reported.

매 vs BLEU

  • BLEU: precision-oriented, machine translation. ROUGE: recall-oriented, summarization.
  • BLEU 의 brevity penalty, ROUGE 의 X (recall handles).
  • BLEU 의 corpus-level geometric mean, ROUGE 의 typically per-example.

매 limitations

  • 매 surface-level: synonym / paraphrase 의 penalize.
  • 매 fluency / factuality 의 capture X.
  • 매 tokenization sensitivity (BPE vs word).
  • 매 reference-dependent: 1 reference 의 high variance.

매 응용

  1. Summarization eval (CNN/DM, XSum, Gigaword).
  2. Long-doc summarization (arXiv, GovReport, BookSum 2026).
  3. RAG answer eval (vs gold answer).
  4. LLM eval reporting (still common alongside BERTScore + LLM-judge).

💻 패턴

rouge-score basic

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(
    ["rouge1", "rouge2", "rougeL", "rougeLsum"],
    use_stemmer=True,
)
ref = "the cat sat on the mat in the morning"
hyp = "a cat was sitting on the mat"
scores = scorer.score(ref, hyp)
# {'rouge1': Score(precision=0.71, recall=0.55, fmeasure=0.62), ...}
print(scores["rougeL"].fmeasure)

HuggingFace evaluate

import evaluate
rouge = evaluate.load("rouge")
results = rouge.compute(
    predictions=["the cat sat on the mat"],
    references=["a cat was sitting on the mat"],
    use_stemmer=True,
    use_aggregator=True,
)
# {'rouge1': 0.66, 'rouge2': 0.33, 'rougeL': 0.66, 'rougeLsum': 0.66}

Batch eval on dataset

from datasets import load_dataset
import evaluate

ds = load_dataset("cnn_dailymail", "3.0.0", split="test[:100]")
rouge = evaluate.load("rouge")

preds = [model.summarize(x["article"]) for x in ds]
refs = [x["highlights"] for x in ds]

results = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
print(f"R-1: {results['rouge1']:.3f}  R-2: {results['rouge2']:.3f}  R-L: {results['rougeL']:.3f}")

Tokenization-aware (multilingual)

# 매 default rouge-score 의 English-only stemmer
# 매 multilingual: pre-tokenize with sentencepiece / language tokenizer
from rouge_score.tokenizers import Tokenizer

class JaTokenizer(Tokenizer):
    def tokenize(self, text):
        import fugashi
        tagger = fugashi.Tagger()
        return [w.surface for w in tagger(text)]

scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], tokenizer=JaTokenizer())

ROUGE alongside semantic metrics (2026 best practice)

import evaluate
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

r = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
b = bertscore.compute(predictions=preds, references=refs, lang="en", model_type="microsoft/deberta-xlarge-mnli")

print(f"R-L: {r['rougeL']:.3f}  BERTScore-F1: {sum(b['f1'])/len(b['f1']):.3f}")

LLM-as-judge supplement

from anthropic import Anthropic
client = Anthropic()

def judge(article, summary):
    msg = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        messages=[{"role": "user", "content": f"""
Rate this summary 1-5 on faithfulness and coverage.
Article: {article}
Summary: {summary}
Output JSON: {{"faithfulness": int, "coverage": int}}
"""}],
    )
    return msg.content[0].text

매 결정 기준

상황 Approach
Quick lexical baseline ROUGE-1 + ROUGE-L F1
Summarization paper report R-1/R-2/R-L all
Semantic eval needed BERTScore + ROUGE both
Factuality matters LLM-judge or QAGS, not ROUGE
Multilingual language-specific tokenizer 의 plug
Production monitoring ROUGE-L + BERTScore + sample LLM-judge

기본값: 매 ROUGE-L F1 + ROUGE-1 F1 의 report, 매 supplement BERTScore for semantic, 매 LLM-judge for nuance. 매 ROUGE alone 의 1980-style — 2026 의 multi-metric.

🔗 Graph

🤖 LLM 활용

언제: write eval scripts, explain ROUGE variants, generate reference summaries for testing. 언제 X: as the metric itself — 매 ROUGE 의 deterministic, no LLM needed. 매 use LLM-judge as separate complementary metric.

안티패턴

  • ROUGE-only eval: synonym 의 penalize, miss semantic equivalence.
  • No stemming: "running" vs "runs" 의 false negative.
  • Single reference: high variance, prefer multi-reference (CNN/DM 의 1, but XSum 의 1, GovReport 의 1 — limitation).
  • ROUGE for QA / dialog: 매 not summarization-shaped — use task-specific metric.
  • ROUGE for factuality: 매 ROUGE 의 surface-only, 매 hallucinated summary 의 still high R if word-overlap.

🧪 검증 / 중복

  • Verified (Lin 2004 ACL "ROUGE: A Package for Automatic Evaluation of Summaries", rouge-score 0.1.x, HF evaluate 2026).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — ROUGE variants, rouge-score patterns, 2026 multi-metric guidance