--- id: wiki-2026-0508-term-frequency-inverse-document- title: Term Frequency-Inverse Document Frequency category: 10_Wiki/Topics status: verified canonical_id: self aliases: [TF-IDF, tfidf, classic IR baseline] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [ir, nlp, retrieval, baseline, sklearn] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: scikit-learn --- # Term Frequency-Inverse Document Frequency ## 매 한 줄 > **"매 term frequency × inverse document frequency — 매 word's importance 의 corpus context 매 weight"**. Karen Spärck Jones (1972) 의 IDF formalization. 2026 매 dense retrieval (BGE, E5) 매 default 매도 매 baseline + hybrid (BM25 + dense) 의 second stage 매 still ubiquitous. ## 매 핵심 ### 매 Formula - **TF**: term의 doc 매 count (raw / log-normalized / frequency). - **IDF**: `log(N / df_t)` — 매 N corpus size, df_t = doc 매 t의 contains 의 count. - **TF-IDF**: TF(t,d) × IDF(t). - **L2 norm**: 매 cosine 의 prepare. ### 매 Variants - Raw TF / log(1+TF) / sublinear. - IDF smoothing: `log((1+N)/(1+df)) + 1`. - BM25: 매 TF saturation + length normalization 의 add. ### 매 응용 1. Search baseline (sklearn + scikit-learn). 2. Hybrid retrieval — 매 BM25 + dense embedding의 reciprocal-rank fuse. 3. Feature extraction 매 classical ML (logistic regression, SVM). 4. Keyword extraction (top-k tfidf). ## 💻 패턴 ### sklearn TF-IDF ```python from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ "the cat sat on the mat", "the dog ate the bone", "cats and dogs are pets", ] vec = TfidfVectorizer(stop_words="english", sublinear_tf=True, ngram_range=(1, 2)) X = vec.fit_transform(corpus) # sparse (n_docs, n_features) print(vec.get_feature_names_out()) ``` ### Cosine search ```python from sklearn.metrics.pairwise import cosine_similarity q = vec.transform(["pet animals"]) sims = cosine_similarity(q, X).flatten() ranking = sims.argsort()[::-1] ``` ### Manual IDF (educational) ```python import math from collections import Counter def compute_idf(corpus_tokens): N = len(corpus_tokens) df = Counter() for tokens in corpus_tokens: for t in set(tokens): df[t] += 1 return {t: math.log((N + 1) / (df_t + 1)) + 1 for t, df_t in df.items()} ``` ### BM25 (preferred over plain TF-IDF for IR) ```python from rank_bm25 import BM25Okapi tokenized = [doc.lower().split() for doc in corpus] bm25 = BM25Okapi(tokenized, k1=1.5, b=0.75) scores = bm25.get_scores("pet animals".split()) ``` ### Hybrid search (2026 standard) ```python import numpy as np def rrf(rankings, k=60): scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) return sorted(scores, key=scores.get, reverse=True) bm25_top = bm25.get_top_n("query".split(), corpus, n=100) dense_top = dense_index.search("query", k=100) final = rrf([bm25_top, dense_top])[:10] ``` ### Top keyword extraction ```python def top_keywords(doc_idx, vec, X, k=10): row = X[doc_idx].toarray().flatten() feats = vec.get_feature_names_out() top = np.argsort(-row)[:k] return [(feats[i], row[i]) for i in top] ``` ### Persistence ```python import joblib joblib.dump((vec, X), "tfidf_index.joblib") vec, X = joblib.load("tfidf_index.joblib") ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 매 small corpus + interpretability | TF-IDF (sklearn) | | 매 medium corpus + better recall | BM25 | | 매 semantic / paraphrase | Dense (BGE-M3, E5) | | 매 production search | Hybrid (BM25 + dense + RRF) | | 매 keyword extraction / explanation | Plain TF-IDF top-k | **기본값**: 매 BM25 baseline → 매 hybrid + reranker (cross-encoder) for 2026 production. ## 🔗 Graph - 부모: [[Information Retrieval]] - 변형: [[BM25]] - 응용: [[Search Engine]] · [[RAG]] - Adjacent: [[Dense Retrieval]] ## 🤖 LLM 활용 **언제**: 매 small corpus 매 lookup, 매 RAG 의 sparse channel, 매 explainability ("matched on 'mat', 'cat'"). **언제 X**: 매 paraphrase / multilingual 매 weak — 매 dense 의 prefer. ## ❌ 안티패턴 - **TF-IDF 만으로 production search**: 매 paraphrase miss. - **No stopword / lowercasing**: 매 noisy features. - **Same vectorizer not pickled**: 매 train/serve mismatch. - **No length normalization**: 매 long docs 의 unfair advantage (use BM25 또는 normalize). ## 🧪 검증 / 중복 - Verified (Spärck Jones 1972; Manning IR Book Ch.6; sklearn TfidfVectorizer 2026). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — TF-IDF formula + sklearn + BM25 + hybrid RRF |