"매 n-gram overlap recall 의 summary". Lin 2004 의 summarization eval 의 classic. 매 R-1 (unigram), R-2 (bigram), R-L (longest common subsequence) 의 trio. 2026 현재 매 still-default reference baseline, 매 supplemented by BERTScore / BARTScore / LLM-judge for semantic eval.
LLM eval reporting (still common alongside BERTScore + LLM-judge).
💻 패턴
rouge-score basic
fromrouge_scoreimportrouge_scorerscorer=rouge_scorer.RougeScorer(["rouge1","rouge2","rougeL","rougeLsum"],use_stemmer=True,)ref="the cat sat on the mat in the morning"hyp="a cat was sitting on the mat"scores=scorer.score(ref,hyp)# {'rouge1': Score(precision=0.71, recall=0.55, fmeasure=0.62), ...}print(scores["rougeL"].fmeasure)
HuggingFace evaluate
importevaluaterouge=evaluate.load("rouge")results=rouge.compute(predictions=["the cat sat on the mat"],references=["a cat was sitting on the mat"],use_stemmer=True,use_aggregator=True,)# {'rouge1': 0.66, 'rouge2': 0.33, 'rougeL': 0.66, 'rougeLsum': 0.66}
# 매 default rouge-score 의 English-only stemmer# 매 multilingual: pre-tokenize with sentencepiece / language tokenizerfromrouge_score.tokenizersimportTokenizerclassJaTokenizer(Tokenizer):deftokenize(self,text):importfugashitagger=fugashi.Tagger()return[w.surfaceforwintagger(text)]scorer=rouge_scorer.RougeScorer(["rouge1","rougeL"],tokenizer=JaTokenizer())
ROUGE alongside semantic metrics (2026 best practice)
fromanthropicimportAnthropicclient=Anthropic()defjudge(article,summary):msg=client.messages.create(model="claude-opus-4-7",max_tokens=200,messages=[{"role":"user","content":f"""
Rate this summary 1-5 on faithfulness and coverage.
Article: {article}Summary: {summary}Output JSON: {{"faithfulness": int, "coverage": int}}"""}],)returnmsg.content[0].text
매 결정 기준
상황
Approach
Quick lexical baseline
ROUGE-1 + ROUGE-L F1
Summarization paper
report R-1/R-2/R-L all
Semantic eval needed
BERTScore + ROUGE both
Factuality matters
LLM-judge or QAGS, not ROUGE
Multilingual
language-specific tokenizer 의 plug
Production monitoring
ROUGE-L + BERTScore + sample LLM-judge
기본값: 매 ROUGE-L F1 + ROUGE-1 F1 의 report, 매 supplement BERTScore for semantic, 매 LLM-judge for nuance. 매 ROUGE alone 의 1980-style — 2026 의 multi-metric.
언제: write eval scripts, explain ROUGE variants, generate reference summaries for testing.
언제 X: as the metric itself — 매 ROUGE 의 deterministic, no LLM needed. 매 use LLM-judge as separate complementary metric.
❌ 안티패턴
ROUGE-only eval: synonym 의 penalize, miss semantic equivalence.
No stemming: "running" vs "runs" 의 false negative.
Single reference: high variance, prefer multi-reference (CNN/DM 의 1, but XSum 의 1, GovReport 의 1 — limitation).
ROUGE for QA / dialog: 매 not summarization-shaped — use task-specific metric.
ROUGE for factuality: 매 ROUGE 의 surface-only, 매 hallucinated summary 의 still high R if word-overlap.
🧪 검증 / 중복
Verified (Lin 2004 ACL "ROUGE: A Package for Automatic Evaluation of Summaries", rouge-score 0.1.x, HF evaluate 2026).