Files
2nd/10_Wiki/Topics/AI_and_ML/Term-Frequency-Inverse-Document-Frequency.md
T
2026-05-10 22:08:15 +09:00

161 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-term-frequency-inverse-document-
title: Term Frequency-Inverse Document Frequency
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [TF-IDF, tfidf, classic IR baseline]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [ir, nlp, retrieval, baseline, sklearn]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: scikit-learn
---
# Term Frequency-Inverse Document Frequency
## 매 한 줄
> **"매 term frequency × inverse document frequency — 매 word's importance 의 corpus context 매 weight"**. Karen Spärck Jones (1972) 의 IDF formalization. 2026 매 dense retrieval (BGE, E5) 매 default 매도 매 baseline + hybrid (BM25 + dense) 의 second stage 매 still ubiquitous.
## 매 핵심
### 매 Formula
- **TF**: term의 doc 매 count (raw / log-normalized / frequency).
- **IDF**: `log(N / df_t)` — 매 N corpus size, df_t = doc 매 t의 contains 의 count.
- **TF-IDF**: TF(t,d) × IDF(t).
- **L2 norm**: 매 cosine 의 prepare.
### 매 Variants
- Raw TF / log(1+TF) / sublinear.
- IDF smoothing: `log((1+N)/(1+df)) + 1`.
- BM25: 매 TF saturation + length normalization 의 add.
### 매 응용
1. Search baseline (sklearn + scikit-learn).
2. Hybrid retrieval — 매 BM25 + dense embedding의 reciprocal-rank fuse.
3. Feature extraction 매 classical ML (logistic regression, SVM).
4. Keyword extraction (top-k tfidf).
## 💻 패턴
### sklearn TF-IDF
```python
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"the cat sat on the mat",
"the dog ate the bone",
"cats and dogs are pets",
]
vec = TfidfVectorizer(stop_words="english", sublinear_tf=True, ngram_range=(1, 2))
X = vec.fit_transform(corpus) # sparse (n_docs, n_features)
print(vec.get_feature_names_out())
```
### Cosine search
```python
from sklearn.metrics.pairwise import cosine_similarity
q = vec.transform(["pet animals"])
sims = cosine_similarity(q, X).flatten()
ranking = sims.argsort()[::-1]
```
### Manual IDF (educational)
```python
import math
from collections import Counter
def compute_idf(corpus_tokens):
N = len(corpus_tokens)
df = Counter()
for tokens in corpus_tokens:
for t in set(tokens):
df[t] += 1
return {t: math.log((N + 1) / (df_t + 1)) + 1 for t, df_t in df.items()}
```
### BM25 (preferred over plain TF-IDF for IR)
```python
from rank_bm25 import BM25Okapi
tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75)
scores = bm25.get_scores("pet animals".split())
```
### Hybrid search (2026 standard)
```python
import numpy as np
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
bm25_top = bm25.get_top_n("query".split(), corpus, n=100)
dense_top = dense_index.search("query", k=100)
final = rrf([bm25_top, dense_top])[:10]
```
### Top keyword extraction
```python
def top_keywords(doc_idx, vec, X, k=10):
row = X[doc_idx].toarray().flatten()
feats = vec.get_feature_names_out()
top = np.argsort(-row)[:k]
return [(feats[i], row[i]) for i in top]
```
### Persistence
```python
import joblib
joblib.dump((vec, X), "tfidf_index.joblib")
vec, X = joblib.load("tfidf_index.joblib")
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| 매 small corpus + interpretability | TF-IDF (sklearn) |
| 매 medium corpus + better recall | BM25 |
| 매 semantic / paraphrase | Dense (BGE-M3, E5) |
| 매 production search | Hybrid (BM25 + dense + RRF) |
| 매 keyword extraction / explanation | Plain TF-IDF top-k |
**기본값**: 매 BM25 baseline → 매 hybrid + reranker (cross-encoder) for 2026 production.
## 🔗 Graph
- 부모: [[Information Retrieval]] · [[Bag of Words]]
- 변형: [[BM25]] · [[BM25F]]
- 응용: [[Search Engine]] · [[Hybrid Retrieval]] · [[RAG]]
- Adjacent: [[Cosine Similarity]] · [[Dense Retrieval]] · [[Cross-Encoder Rerank]]
## 🤖 LLM 활용
**언제**: 매 small corpus 매 lookup, 매 RAG 의 sparse channel, 매 explainability ("matched on 'mat', 'cat'").
**언제 X**: 매 paraphrase / multilingual 매 weak — 매 dense 의 prefer.
## ❌ 안티패턴
- **TF-IDF 만으로 production search**: 매 paraphrase miss.
- **No stopword / lowercasing**: 매 noisy features.
- **Same vectorizer not pickled**: 매 train/serve mismatch.
- **No length normalization**: 매 long docs 의 unfair advantage (use BM25 또는 normalize).
## 🧪 검증 / 중복
- Verified (Spärck Jones 1972; Manning IR Book Ch.6; sklearn TfidfVectorizer 2026).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — TF-IDF formula + sklearn + BM25 + hybrid RRF |