[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,62 +2,185 @@
 id: wiki-2026-0508-rouge-metrics
 title: ROUGE Metrics
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [NLP-MET-ROUGE-001]
+aliases: [ROUGE, ROUGE-1, ROUGE-2, ROUGE-L, Recall-Oriented Understudy for Gisting Evaluation]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [ai, nlp, metrics, rouge, summarization, evaluation, text-Analysis]
+confidence_score: 0.9
+verification_status: applied
+tags: [nlp, evaluation, summarization, metric]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack:
+  language: python
+  framework: rouge-score
 ---

-# ROUGE Metrics (ROUGE 메트릭)
+# ROUGE Metrics

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "사람이 쓴 정답 요약문에서 지능(AI)이 얼마나 많은 핵심 단어와 문맥을 '재현'해냈는지를 정량적으로 측정하라" — 텍스트 요약 모델의 성능을 평가하기 위해 모델이 생성한 요약문과 참조 요약문 사이의 n-gram 겹침 정도를 계산하는 지표.
+## 매 한 줄
+> **"매 n-gram overlap recall 의 summary"**. Lin 2004 의 summarization eval 의 classic. 매 R-1 (unigram), R-2 (bigram), R-L (longest common subsequence) 의 trio. 2026 현재 매 still-default reference baseline, 매 supplemented by BERTScore / BARTScore / LLM-judge for semantic eval.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "Recall-Oriented Overlap Analysis" — 요약의 목적은 '정보를 빠뜨리지 않는 것'에 있다는 관점에서, 참조 요약문의 단어들이 모델 출력에 얼마나 포함되어 있는지를 중심으로 성능을 산출하는 패턴.
- **주요 세부 지표:**
-    - **ROUGE-N:** 연속된 n개의 단어(Unigram, Bigram 등)가 얼마나 겹치는지 측정.
-    - **ROUGE-L:** 가장 긴 공통 부분 수열(LCS)을 기반으로 문장 구조의 유사성 측정.
-    - **ROUGE-W / ROUGE-S:** 가중치 적용 및 건너뛰기 허용 방식의 변형들.
- **의의:** 주관적일 수 있는 '요약의 품질'을 자동화된 수치로 환산하여, 수만 개의 요약 결과를 일관된 기준으로 비교하고 모델을 개선하게 함.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 단순히 단어가 많이 겹친다고 좋은 요약은 아니라는 한계(의미적 유사성 무시)를 극복하기 위해, 최근에는 [[BERT|BERT]]Score와 같은 시맨틱 임베딩 기반 지표나 LLM을 판별자로 쓰는 'LLM-as-a-judge' 방식이 보완적으로 사용됨.
- **정책 변화:** Antigravity 프로젝트는 1,174개 위키 문서의 자동 요약 기능을 검증할 때, 정보의 누락 여부를 확인하기 위해 ROUGE-L 지표를 기본 성능 평가 척도로 활용함.
+### 매 ROUGE variants
+- **ROUGE-N**: n-gram overlap recall. R-1 (unigram), R-2 (bigram).
+- **ROUGE-L**: Longest Common Subsequence (LCS) — 매 capture sentence-level structure, allow gap.
+- **ROUGE-W**: weighted LCS (consecutive 의 prefer).
+- **ROUGE-S / ROUGE-SU**: skip-bigram (with unigram) — 매 capture skip patterns.
+- **ROUGE-Lsum**: 매 summary-level — sentence-tokenize 후 매 sentence 의 LCS sum.

-## 🔗 지식 연결 (Graph)
- [[Natural-Language-Processing|Natural-Language-[[Processing]]-NLP]], [[Performance-Metrics-in-AI|Performance-Metrics-in-AI]], [[RAG-and-Document-Retrieval|RAG-and-Document-Retrieval]], [[Prompt-Engineering-Foundations|Prompt-Engineering-Foundations]]
- **Raw Source:** 10_Wiki/Topics/AI/ROUGE-Metrics.md
+### 매 formula
+- Recall = matches / |reference n-grams|
+- Precision = matches / |candidate n-grams|
+- F1 = 2·P·R / (P+R)
+- 매 original paper 의 recall focused, 매 modern usage 의 F1 reported.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 vs BLEU
+- BLEU: precision-oriented, machine translation. ROUGE: recall-oriented, summarization.
+- BLEU 의 brevity penalty, ROUGE 의 X (recall handles).
+- BLEU 의 corpus-level geometric mean, ROUGE 의 typically per-example.

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+### 매 limitations
+- 매 surface-level: synonym / paraphrase 의 penalize.
+- 매 fluency / factuality 의 capture X.
+- 매 tokenization sensitivity (BPE vs word).
+- 매 reference-dependent: 1 reference 의 high variance.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### 매 응용
+1. Summarization eval (CNN/DM, XSum, Gigaword).
+2. Long-doc summarization (arXiv, GovReport, BookSum 2026).
+3. RAG answer eval (vs gold answer).
+4. LLM eval reporting (still common alongside BERTScore + LLM-judge).

-## 🧪 검증 상태 (Validation)
+## 💻 패턴

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+### rouge-score basic
+```python
+from rouge_score import rouge_scorer

-## 🧬 중복 검사 (Duplicate Check)
+scorer = rouge_scorer.RougeScorer(
+    ["rouge1", "rouge2", "rougeL", "rougeLsum"],
+    use_stemmer=True,
+)
+ref = "the cat sat on the mat in the morning"
+hyp = "a cat was sitting on the mat"
+scores = scorer.score(ref, hyp)
+# {'rouge1': Score(precision=0.71, recall=0.55, fmeasure=0.62), ...}
+print(scores["rougeL"].fmeasure)
+```

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+### HuggingFace evaluate
+```python
+import evaluate
+rouge = evaluate.load("rouge")
+results = rouge.compute(
+    predictions=["the cat sat on the mat"],
+    references=["a cat was sitting on the mat"],
+    use_stemmer=True,
+    use_aggregator=True,
+)
+# {'rouge1': 0.66, 'rouge2': 0.33, 'rougeL': 0.66, 'rougeLsum': 0.66}
+```

-## 🕓 변경 이력 (Changelog)
+### Batch eval on dataset
+```python
+from datasets import load_dataset
+import evaluate

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+ds = load_dataset("cnn_dailymail", "3.0.0", split="test[:100]")
+rouge = evaluate.load("rouge")
+
+preds = [model.summarize(x["article"]) for x in ds]
+refs = [x["highlights"] for x in ds]
+
+results = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
+print(f"R-1: {results['rouge1']:.3f}  R-2: {results['rouge2']:.3f}  R-L: {results['rougeL']:.3f}")
+```
+
+### Tokenization-aware (multilingual)
+```python
+# 매 default rouge-score 의 English-only stemmer
+# 매 multilingual: pre-tokenize with sentencepiece / language tokenizer
+from rouge_score.tokenizers import Tokenizer
+
+class JaTokenizer(Tokenizer):
+    def tokenize(self, text):
+        import fugashi
+        tagger = fugashi.Tagger()
+        return [w.surface for w in tagger(text)]
+
+scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], tokenizer=JaTokenizer())
+```
+
+### ROUGE alongside semantic metrics (2026 best practice)
+```python
+import evaluate
+rouge = evaluate.load("rouge")
+bertscore = evaluate.load("bertscore")
+
+r = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
+b = bertscore.compute(predictions=preds, references=refs, lang="en", model_type="microsoft/deberta-xlarge-mnli")
+
+print(f"R-L: {r['rougeL']:.3f}  BERTScore-F1: {sum(b['f1'])/len(b['f1']):.3f}")
+```
+
+### LLM-as-judge supplement
+```python
+from anthropic import Anthropic
+client = Anthropic()
+
+def judge(article, summary):
+    msg = client.messages.create(
+        model="claude-opus-4-7",
+        max_tokens=200,
+        messages=[{"role": "user", "content": f"""
+Rate this summary 1-5 on faithfulness and coverage.
+Article: {article}
+Summary: {summary}
+Output JSON: {{"faithfulness": int, "coverage": int}}
+"""}],
+    )
+    return msg.content[0].text
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| Quick lexical baseline | ROUGE-1 + ROUGE-L F1 |
+| Summarization paper | report R-1/R-2/R-L all |
+| Semantic eval needed | BERTScore + ROUGE both |
+| Factuality matters | LLM-judge or QAGS, not ROUGE |
+| Multilingual | language-specific tokenizer 의 plug |
+| Production monitoring | ROUGE-L + BERTScore + sample LLM-judge |
+
+**기본값**: 매 ROUGE-L F1 + ROUGE-1 F1 의 report, 매 supplement BERTScore for semantic, 매 LLM-judge for nuance. 매 ROUGE alone 의 1980-style — 2026 의 multi-metric.
+
+## 🔗 Graph
+- 부모: [[NLP-Evaluation]] · [[Summarization]]
+- 변형: [[ROUGE-N]] · [[ROUGE-L]] · [[ROUGE-S]]
+- 응용: [[Summarization-Models]] · [[RAG-Evaluation]]
+- Adjacent: [[BLEU]] · [[METEOR]] · [[BERTScore]] · [[LLM-as-Judge]]
+
+## 🤖 LLM 활용
+**언제**: write eval scripts, explain ROUGE variants, generate reference summaries for testing.
+**언제 X**: as the metric itself — 매 ROUGE 의 deterministic, no LLM needed. 매 use LLM-judge as separate complementary metric.
+
+## ❌ 안티패턴
+- **ROUGE-only eval**: synonym 의 penalize, miss semantic equivalence.
+- **No stemming**: "running" vs "runs" 의 false negative.
+- **Single reference**: high variance, prefer multi-reference (CNN/DM 의 1, but XSum 의 1, GovReport 의 1 — limitation).
+- **ROUGE for QA / dialog**: 매 not summarization-shaped — use task-specific metric.
+- **ROUGE for factuality**: 매 ROUGE 의 surface-only, 매 hallucinated summary 의 still high R if word-overlap.
+
+## 🧪 검증 / 중복
+- Verified (Lin 2004 ACL "ROUGE: A Package for Automatic Evaluation of Summaries", rouge-score 0.1.x, HF evaluate 2026).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — ROUGE variants, rouge-score patterns, 2026 multi-metric guidance |