---
id: wiki-2026-0508-bag-of-words-bow
title: Bag of Words (BoW)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [BoW, 단어 가방, count vectorizer, TF-IDF, n-gram]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [nlp, text-representation, bow, tfidf, ngram, baseline, classical-ml, sparse-vector]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: scikit-learn / NLTK / Gensim
---

# Bag of Words (BoW)

## 📌 한 줄 통찰
> **"매 단어 의 빈도 만"**. 매 grammar / order 무시 + 매 frequency count. 매 NLP 의 가장 simple. 매 modern transformer 가 dominant 가, 매 baseline / fast classifier / interpretability 의 still relevant.

## 📖 핵심

### 매 단계
1. **Tokenize**: 매 text → 매 word.
2. **Vocabulary**: 매 corpus 의 unique word 의 set.
3. **Count**: 매 doc 의 word frequency.
4. **Vectorize**: 매 sparse vector.

### 매 특징
- **Order-invariant**: "I eat apple" = "apple eat I".
- **Sparse**: 매 vocab 10K, 매 doc 의 100 word — 99% 가 0.
- **High-dim**: 매 vocab size = 매 dim.
- **Fast**: 매 linear in doc length.
- **Interpretable**: 매 feature 가 word.

### 매 변형

#### Pure BoW (count)
- 매 단순 frequency.
- 매 common word ("the", "a") 의 dominate.

#### TF-IDF
$$tfidf(t, d) = tf(t, d) \cdot \log\frac{N}{df(t)}$$
- 매 common 의 down-weight.
- 매 rare + frequent in doc 의 boost.

#### N-gram
- 매 unigram (1 word).
- 매 bigram (2 word: "New York").
- 매 trigram.
- → 매 limited order capture.

#### Hashing trick
- 매 vocabulary build X.
- 매 word → hash → bucket.
- 매 streaming + memory OK.
- 매 collision 의 cost.

### vs Word Embedding
| 측면 | BoW | Embedding |
|---|---|---|
| Dim | High (vocab) | Low (~300) |
| Sparse | ✓ | ✗ |
| Semantic | ✗ | ✓ |
| Order | ✗ | ✗ (Word2Vec) / ✓ (Transformer) |
| Speed | Fast | Slow |
| Memory | High | Low |
| Interpretable | High | Low |

### 매 still useful
1. **Spam classification**: 매 fast + accurate.
2. **Topic modeling** (LDA): 매 BoW 기반.
3. **Document retrieval** (BM25): 매 IR 의 baseline.
4. **Quick prototyping**: 매 transformer overkill.
5. **Interpretability**: 매 feature importance.
6. **Resource-constrained**: 매 edge / mobile.

## 💻 패턴

### Scikit-learn CountVectorizer
```python
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['I love NLP', 'NLP is fun', 'I love coding']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['coding', 'fun', 'is', 'love', 'nlp']
print(X.toarray())
# [[0 0 0 1 1] [0 1 1 0 1] [1 0 0 1 0]]
```

### TF-IDF
```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    max_features=10_000,
    min_df=2,
    max_df=0.95,
    stop_words='english',
)
X = vectorizer.fit_transform(corpus)
```

### Spam classifier (BoW + Naive Bayes)
```python
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('clf', MultinomialNB()),
])
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
```

→ 매 transformer 의 overkill 의 case.

### BM25 (modern IR)
```python
from rank_bm25 import BM25Okapi

corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)
query = 'machine learning algorithm'.split()
scores = bm25.get_scores(query)
top_k = np.argsort(scores)[-5:][::-1]
```

### Hashing vectorizer (streaming)
```python
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=2**18, alternate_sign=False)
# 매 fit X — 매 streaming OK
for batch in stream:
    X = vectorizer.transform(batch)
    model.partial_fit(X, y)
```

### Topic modeling (LDA)
```python
from sklearn.decomposition import LatentDirichletAllocation

vectorizer = CountVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(documents)
lda = LatentDirichletAllocation(n_components=10, random_state=42)
lda.fit(X)

# 매 topic 의 top word
for topic_idx, topic in enumerate(lda.components_):
    top = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]]
    print(f'Topic {topic_idx}: {top}')
```

## 🤔 결정 기준
| 상황 | Approach |
|---|---|
| Fast prototype | TF-IDF + LinearSVC |
| Spam / topic class | TF-IDF + Naive Bayes |
| Document retrieval | BM25 |
| Topic modeling | BoW + LDA |
| Semantic search | Embedding (NOT BoW) |
| QA / generation | Transformer (NOT BoW) |
| Resource-constrained | Hashing vectorizer |

**기본값**: 매 baseline = TF-IDF + LinearSVC. 매 result 의 transformer 와 비교.

## 🔗 Graph
- 부모: [[NLP]] · [[Text-Representation]] · [[Information-Retrieval]]
- 변형: [[TF-IDF]] · [[N-gram]] · [[Hashing-Trick]] · [[BM25]]
- 응용: [[Spam-Classification]] · [[Topic-Modeling]] · [[LDA]] · [[Document-Retrieval]]
- 대체: [[Word2Vec]] · [[Sentence-Transformers]] · [[BERT]] · [[Embedding]]
- Adjacent: [[Naive-Bayes]] · [[Linear-SVM]] · [[Stop-Words]] · [[Stemming]]

## 🤖 LLM 활용
**언제**: 매 baseline. 매 fast classifier. 매 interpretability 필요. 매 IR. 매 topic modeling.
**언제 X**: 매 semantic similarity. 매 generation. 매 long-context understanding. 매 word order matter.

## ❌ 안티패턴
- **No stop word removal** (small vocab): 매 noise.
- **No min_df / max_df**: 매 typo / common 의 dominate.
- **Vocab 의 fit on test**: 매 leakage.
- **High-dim 의 dense conversion**: 매 OOM.
- **Word order matter 한 task 의 BoW**: 매 wrong tool.
- **모든 task 의 BERT**: 매 BoW 의 fast 의 lose.

## 🧪 검증 / 중복
- Verified (Manning IR, scikit-learn docs).
- 신뢰도 A.
- Related: [[TF-IDF]] · [[BM25]] · [[Word2Vec]] · [[Naive-Bayes]].

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — TF-IDF + N-gram + BM25 + 매 sklearn code |