"매 단어 의 빈도 만". 매 grammar / order 무시 + 매 frequency count. 매 NLP 의 가장 simple. 매 modern transformer 가 dominant 가, 매 baseline / fast classifier / interpretability 의 still relevant.
📖 핵심
매 단계
Tokenize: 매 text → 매 word.
Vocabulary: 매 corpus 의 unique word 의 set.
Count: 매 doc 의 word frequency.
Vectorize: 매 sparse vector.
매 특징
Order-invariant: "I eat apple" = "apple eat I".
Sparse: 매 vocab 10K, 매 doc 의 100 word — 99% 가 0.
High-dim: 매 vocab size = 매 dim.
Fast: 매 linear in doc length.
Interpretable: 매 feature 가 word.
매 변형
Pure BoW (count)
매 단순 frequency.
매 common word ("the", "a") 의 dominate.
TF-IDF
tfidf(t, d) = tf(t, d) \cdot \log\frac{N}{df(t)}
매 common 의 down-weight.
매 rare + frequent in doc 의 boost.
N-gram
매 unigram (1 word).
매 bigram (2 word: "New York").
매 trigram.
→ 매 limited order capture.
Hashing trick
매 vocabulary build X.
매 word → hash → bucket.
매 streaming + memory OK.
매 collision 의 cost.
vs Word Embedding
측면
BoW
Embedding
Dim
High (vocab)
Low (~300)
Sparse
✓
✗
Semantic
✗
✓
Order
✗
✗ (Word2Vec) / ✓ (Transformer)
Speed
Fast
Slow
Memory
High
Low
Interpretable
High
Low
매 still useful
Spam classification: 매 fast + accurate.
Topic modeling (LDA): 매 BoW 기반.
Document retrieval (BM25): 매 IR 의 baseline.
Quick prototyping: 매 transformer overkill.
Interpretability: 매 feature importance.
Resource-constrained: 매 edge / mobile.
💻 패턴
Scikit-learn CountVectorizer
fromsklearn.feature_extraction.textimportCountVectorizercorpus=['I love NLP','NLP is fun','I love coding']vectorizer=CountVectorizer()X=vectorizer.fit_transform(corpus)print(vectorizer.get_feature_names_out())# ['coding', 'fun', 'is', 'love', 'nlp']print(X.toarray())# [[0 0 0 1 1] [0 1 1 0 1] [1 0 0 1 0]]
fromsklearn.feature_extraction.textimportHashingVectorizervectorizer=HashingVectorizer(n_features=2**18,alternate_sign=False)# 매 fit X — 매 streaming OKforbatchinstream:X=vectorizer.transform(batch)model.partial_fit(X,y)
Topic modeling (LDA)
fromsklearn.decompositionimportLatentDirichletAllocationvectorizer=CountVectorizer(max_features=5000,stop_words='english')X=vectorizer.fit_transform(documents)lda=LatentDirichletAllocation(n_components=10,random_state=42)lda.fit(X)# 매 topic 의 top wordfortopic_idx,topicinenumerate(lda.components_):top=[vectorizer.get_feature_names_out()[i]foriintopic.argsort()[-10:]]print(f'Topic {topic_idx}: {top}')
🤔 결정 기준
상황
Approach
Fast prototype
TF-IDF + LinearSVC
Spam / topic class
TF-IDF + Naive Bayes
Document retrieval
BM25
Topic modeling
BoW + LDA
Semantic search
Embedding (NOT BoW)
QA / generation
Transformer (NOT BoW)
Resource-constrained
Hashing vectorizer
기본값: 매 baseline = TF-IDF + LinearSVC. 매 result 의 transformer 와 비교.
언제: 매 baseline. 매 fast classifier. 매 interpretability 필요. 매 IR. 매 topic modeling.
언제 X: 매 semantic similarity. 매 generation. 매 long-context understanding. 매 word order matter.