"매 term frequency × inverse document frequency — 매 word's importance 의 corpus context 매 weight". Karen Spärck Jones (1972) 의 IDF formalization. 2026 매 dense retrieval (BGE, E5) 매 default 매도 매 baseline + hybrid (BM25 + dense) 의 second stage 매 still ubiquitous.
매 핵심
매 Formula
TF: term의 doc 매 count (raw / log-normalized / frequency).
IDF: log(N / df_t) — 매 N corpus size, df_t = doc 매 t의 contains 의 count.
TF-IDF: TF(t,d) × IDF(t).
L2 norm: 매 cosine 의 prepare.
매 Variants
Raw TF / log(1+TF) / sublinear.
IDF smoothing: log((1+N)/(1+df)) + 1.
BM25: 매 TF saturation + length normalization 의 add.
매 응용
Search baseline (sklearn + scikit-learn).
Hybrid retrieval — 매 BM25 + dense embedding의 reciprocal-rank fuse.
Feature extraction 매 classical ML (logistic regression, SVM).
Keyword extraction (top-k tfidf).
💻 패턴
sklearn TF-IDF
fromsklearn.feature_extraction.textimportTfidfVectorizercorpus=["the cat sat on the mat","the dog ate the bone","cats and dogs are pets",]vec=TfidfVectorizer(stop_words="english",sublinear_tf=True,ngram_range=(1,2))X=vec.fit_transform(corpus)# sparse (n_docs, n_features)print(vec.get_feature_names_out())
언제: 매 small corpus 매 lookup, 매 RAG 의 sparse channel, 매 explainability ("matched on 'mat', 'cat'").
언제 X: 매 paraphrase / multilingual 매 weak — 매 dense 의 prefer.
❌ 안티패턴
TF-IDF 만으로 production search: 매 paraphrase miss.
No stopword / lowercasing: 매 noisy features.
Same vectorizer not pickled: 매 train/serve mismatch.
No length normalization: 매 long docs 의 unfair advantage (use BM25 또는 normalize).
🧪 검증 / 중복
Verified (Spärck Jones 1972; Manning IR Book Ch.6; sklearn TfidfVectorizer 2026).