--- id: wiki-2026-0508-bag-of-words-bow title: Bag of Words (BoW) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [BoW, 단어 가방, count vectorizer, TF-IDF, n-gram] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [nlp, text-representation, bow, tfidf, ngram, baseline, classical-ml, sparse-vector] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn / NLTK / Gensim --- # Bag of Words (BoW) ## 📌 한 줄 통찰 > **"매 단어 의 빈도 만"**. 매 grammar / order 무시 + 매 frequency count. 매 NLP 의 가장 simple. 매 modern transformer 가 dominant 가, 매 baseline / fast classifier / interpretability 의 still relevant. ## 📖 핵심 ### 매 단계 1. **Tokenize**: 매 text → 매 word. 2. **Vocabulary**: 매 corpus 의 unique word 의 set. 3. **Count**: 매 doc 의 word frequency. 4. **Vectorize**: 매 sparse vector. ### 매 특징 - **Order-invariant**: "I eat apple" = "apple eat I". - **Sparse**: 매 vocab 10K, 매 doc 의 100 word — 99% 가 0. - **High-dim**: 매 vocab size = 매 dim. - **Fast**: 매 linear in doc length. - **Interpretable**: 매 feature 가 word. ### 매 변형 #### Pure BoW (count) - 매 단순 frequency. - 매 common word ("the", "a") 의 dominate. #### TF-IDF $$tfidf(t, d) = tf(t, d) \cdot \log\frac{N}{df(t)}$$ - 매 common 의 down-weight. - 매 rare + frequent in doc 의 boost. #### N-gram - 매 unigram (1 word). - 매 bigram (2 word: "New York"). - 매 trigram. - → 매 limited order capture. #### Hashing trick - 매 vocabulary build X. - 매 word → hash → bucket. - 매 streaming + memory OK. - 매 collision 의 cost. ### vs Word Embedding | 측면 | BoW | Embedding | |---|---|---| | Dim | High (vocab) | Low (~300) | | Sparse | ✓ | ✗ | | Semantic | ✗ | ✓ | | Order | ✗ | ✗ (Word2Vec) / ✓ (Transformer) | | Speed | Fast | Slow | | Memory | High | Low | | Interpretable | High | Low | ### 매 still useful 1. **Spam classification**: 매 fast + accurate. 2. **Topic modeling** (LDA): 매 BoW 기반. 3. **Document retrieval** (BM25): 매 IR 의 baseline. 4. **Quick prototyping**: 매 transformer overkill. 5. **Interpretability**: 매 feature importance. 6. **Resource-constrained**: 매 edge / mobile. ## 💻 패턴 ### Scikit-learn CountVectorizer ```python from sklearn.feature_extraction.text import CountVectorizer corpus = ['I love NLP', 'NLP is fun', 'I love coding'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out()) # ['coding', 'fun', 'is', 'love', 'nlp'] print(X.toarray()) # [[0 0 0 1 1] [0 1 1 0 1] [1 0 0 1 0]] ``` ### TF-IDF ```python from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer( ngram_range=(1, 2), max_features=10_000, min_df=2, max_df=0.95, stop_words='english', ) X = vectorizer.fit_transform(corpus) ``` ### Spam classifier (BoW + Naive Bayes) ```python from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB pipe = Pipeline([ ('tfidf', TfidfVectorizer(ngram_range=(1, 2))), ('clf', MultinomialNB()), ]) pipe.fit(X_train, y_train) print(pipe.score(X_test, y_test)) ``` → 매 transformer 의 overkill 의 case. ### BM25 (modern IR) ```python from rank_bm25 import BM25Okapi corpus = [doc.split() for doc in documents] bm25 = BM25Okapi(corpus) query = 'machine learning algorithm'.split() scores = bm25.get_scores(query) top_k = np.argsort(scores)[-5:][::-1] ``` ### Hashing vectorizer (streaming) ```python from sklearn.feature_extraction.text import HashingVectorizer vectorizer = HashingVectorizer(n_features=2**18, alternate_sign=False) # 매 fit X — 매 streaming OK for batch in stream: X = vectorizer.transform(batch) model.partial_fit(X, y) ``` ### Topic modeling (LDA) ```python from sklearn.decomposition import LatentDirichletAllocation vectorizer = CountVectorizer(max_features=5000, stop_words='english') X = vectorizer.fit_transform(documents) lda = LatentDirichletAllocation(n_components=10, random_state=42) lda.fit(X) # 매 topic 의 top word for topic_idx, topic in enumerate(lda.components_): top = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]] print(f'Topic {topic_idx}: {top}') ``` ## 🤔 결정 기준 | 상황 | Approach | |---|---| | Fast prototype | TF-IDF + LinearSVC | | Spam / topic class | TF-IDF + Naive Bayes | | Document retrieval | BM25 | | Topic modeling | BoW + LDA | | Semantic search | Embedding (NOT BoW) | | QA / generation | Transformer (NOT BoW) | | Resource-constrained | Hashing vectorizer | **기본값**: 매 baseline = TF-IDF + LinearSVC. 매 result 의 transformer 와 비교. ## 🔗 Graph - 부모: [[NLP]] · [[Text-Representation]] · [[Information-Retrieval]] - 변형: [[TF-IDF]] · [[N-gram]] · [[Hashing-Trick]] · [[BM25]] - 응용: [[Spam-Classification]] · [[Topic-Modeling]] · [[LDA]] · [[Document-Retrieval]] - 대체: [[Word2Vec]] · [[Sentence-Transformers]] · [[BERT]] · [[Embedding]] - Adjacent: [[Naive-Bayes]] · [[Linear-SVM]] · [[Stop-Words]] · [[Stemming]] ## 🤖 LLM 활용 **언제**: 매 baseline. 매 fast classifier. 매 interpretability 필요. 매 IR. 매 topic modeling. **언제 X**: 매 semantic similarity. 매 generation. 매 long-context understanding. 매 word order matter. ## ❌ 안티패턴 - **No stop word removal** (small vocab): 매 noise. - **No min_df / max_df**: 매 typo / common 의 dominate. - **Vocab 의 fit on test**: 매 leakage. - **High-dim 의 dense conversion**: 매 OOM. - **Word order matter 한 task 의 BoW**: 매 wrong tool. - **모든 task 의 BERT**: 매 BoW 의 fast 의 lose. ## 🧪 검증 / 중복 - Verified (Manning IR, scikit-learn docs). - 신뢰도 A. - Related: [[TF-IDF]] · [[BM25]] · [[Word2Vec]] · [[Naive-Bayes]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — TF-IDF + N-gram + BM25 + 매 sklearn code |