Files
2nd/10_Wiki/Topics/AI_and_ML/Term-Frequency-Inverse-Document-Frequency.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

161 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-term-frequency-inverse-document-
title: Term Frequency-Inverse Document Frequency
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [TF-IDF, tfidf, classic IR baseline]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [ir, nlp, retrieval, baseline, sklearn]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: scikit-learn
---
# Term Frequency-Inverse Document Frequency
## 매 한 줄
> **"매 term frequency × inverse document frequency — 매 word's importance 의 corpus context 매 weight"**. Karen Spärck Jones (1972) 의 IDF formalization. 2026 매 dense retrieval (BGE, E5) 매 default 매도 매 baseline + hybrid (BM25 + dense) 의 second stage 매 still ubiquitous.
## 매 핵심
### 매 Formula
- **TF**: term의 doc 매 count (raw / log-normalized / frequency).
- **IDF**: `log(N / df_t)` — 매 N corpus size, df_t = doc 매 t의 contains 의 count.
- **TF-IDF**: TF(t,d) × IDF(t).
- **L2 norm**: 매 cosine 의 prepare.
### 매 Variants
- Raw TF / log(1+TF) / sublinear.
- IDF smoothing: `log((1+N)/(1+df)) + 1`.
- BM25: 매 TF saturation + length normalization 의 add.
### 매 응용
1. Search baseline (sklearn + scikit-learn).
2. Hybrid retrieval — 매 BM25 + dense embedding의 reciprocal-rank fuse.
3. Feature extraction 매 classical ML (logistic regression, SVM).
4. Keyword extraction (top-k tfidf).
## 💻 패턴
### sklearn TF-IDF
```python
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"the cat sat on the mat",
"the dog ate the bone",
"cats and dogs are pets",
]
vec = TfidfVectorizer(stop_words="english", sublinear_tf=True, ngram_range=(1, 2))
X = vec.fit_transform(corpus) # sparse (n_docs, n_features)
print(vec.get_feature_names_out())
```
### Cosine search
```python
from sklearn.metrics.pairwise import cosine_similarity
q = vec.transform(["pet animals"])
sims = cosine_similarity(q, X).flatten()
ranking = sims.argsort()[::-1]
```
### Manual IDF (educational)
```python
import math
from collections import Counter
def compute_idf(corpus_tokens):
N = len(corpus_tokens)
df = Counter()
for tokens in corpus_tokens:
for t in set(tokens):
df[t] += 1
return {t: math.log((N + 1) / (df_t + 1)) + 1 for t, df_t in df.items()}
```
### BM25 (preferred over plain TF-IDF for IR)
```python
from rank_bm25 import BM25Okapi
tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)
scores = bm25.get_scores("pet animals".split())
```
### Hybrid search (2026 standard)
```python
import numpy as np
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
bm25_top = bm25.get_top_n("query".split(), corpus, n=100)
dense_top = dense_index.search("query", k=100)
final = rrf([bm25_top, dense_top])[:10]
```
### Top keyword extraction
```python
def top_keywords(doc_idx, vec, X, k=10):
row = X[doc_idx].toarray().flatten()
feats = vec.get_feature_names_out()
top = np.argsort(-row)[:k]
return [(feats[i], row[i]) for i in top]
```
### Persistence
```python
import joblib
joblib.dump((vec, X), "tfidf_index.joblib")
vec, X = joblib.load("tfidf_index.joblib")
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| 매 small corpus + interpretability | TF-IDF (sklearn) |
| 매 medium corpus + better recall | BM25 |
| 매 semantic / paraphrase | Dense (BGE-M3, E5) |
| 매 production search | Hybrid (BM25 + dense + RRF) |
| 매 keyword extraction / explanation | Plain TF-IDF top-k |
**기본값**: 매 BM25 baseline → 매 hybrid + reranker (cross-encoder) for 2026 production.
## 🔗 Graph
- 부모: [[Information Retrieval]]
- 변형: [[BM25]]
- 응용: [[Search Engine]] · [[RAG]]
- Adjacent: [[Dense Retrieval]]
## 🤖 LLM 활용
**언제**: 매 small corpus 매 lookup, 매 RAG 의 sparse channel, 매 explainability ("matched on 'mat', 'cat'").
**언제 X**: 매 paraphrase / multilingual 매 weak — 매 dense 의 prefer.
## ❌ 안티패턴
- **TF-IDF 만으로 production search**: 매 paraphrase miss.
- **No stopword / lowercasing**: 매 noisy features.
- **Same vectorizer not pickled**: 매 train/serve mismatch.
- **No length normalization**: 매 long docs 의 unfair advantage (use BM25 또는 normalize).
## 🧪 검증 / 중복
- Verified (Spärck Jones 1972; Manning IR Book Ch.6; sklearn TfidfVectorizer 2026).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — TF-IDF formula + sklearn + BM25 + hybrid RRF |