[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,93 +2,206 @@
 id: wiki-2026-0508-bag-of-words-bow
 title: Bag of Words (BoW)
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AUTO-BOW-001]
+aliases: [BoW, 단어 가방, count vectorizer, TF-IDF, n-gram]
 duplicate_of: none
 source_trust_level: A
 confidence_score: 0.95
-tags: [auto-reinforced, bag-of-words, nlp, Text-Mining, feature-extraction, classic-ai]
+verification_status: applied
+tags: [nlp, text-representation, bow, tfidf, ngram, baseline, classical-ml, sparse-vector]
 raw_sources: []
-last_reinforced: 2026-04-20
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: Python
+  framework: scikit-learn / NLTK / Gensim
 ---

-# [[Bag of Words (BoW)|Bag of Words (BoW)]]
+# Bag of Words (BoW)

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "단어들의 주머니: 문장의 문법이나 단어의 순서는 완전히 무시한 채, 오직 어떤 단어가 몇 번 등장했는지 그 빈도수만을 세어 텍스트를 숫자의 뭉치로 변환하는 가장 단순하고 강력한 언어 처리 기초."
+## 📌 한 줄 통찰
+> **"매 단어 의 빈도 만"**. 매 grammar / order 무시 + 매 frequency count. 매 NLP 의 가장 simple. 매 modern transformer 가 dominant 가, 매 baseline / fast classifier / interpretability 의 still relevant.

-## 📖 구조화된 지식 (Synthesized Content)
-Bag of Words(BoW)는 텍스트 데이터를 머신러닝 알고리즘이 이해할 수 있도록 수치형 벡터로 변환하는 표현 기법 중 하나입니다.
+## 📖 핵심

-1.  **구현 단계**:
-    *   **Vocabulary 구축**: 전체 데이터셋에 등장하는 모든 고유 단어의 목록 생성.
-    *   **Counting**: 특정 문서 내에서 각 단어가 몇 번 나타나는지 횟수 기록.
-2.  **특징**:
-    *   **Loss of Order**: "I eat apple"과 "Apple eat I"를 동일하게 취급하는 한계.
-    *   **Sparse Vector**: 단어 사전은 크지만 실제 한 문장에 쓰이는 단어는 적어 대부분의 값이 0인 거대 행렬 형성.
-3.  **발전형**:
-    *   **TF-IDF**: 단순히 빈도만 따지지 않고, 흔한 단어(The, A 등)의 점수를 낮춰 핵심 단어를 부각함.
+### 매 단계
+1. **Tokenize**: 매 text → 매 word.
+2. **Vocabulary**: 매 corpus 의 unique word 의 set.
+3. **Count**: 매 doc 의 word frequency.
+4. **Vectorize**: 매 sparse vector.

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌**: 과거 자연어 처리 정책의 주류였으나, 현대의 임베딩 정책은 단어의 순서와 관계(Context)를 보존하는 'Word Embedding/Attention 정책'으로 대체됨(RL Update).
- **정책 변화(RL Update)**: 아주 가벼운 스팸 분류 시스템이나 초기 단계의 데이터 탐색 정책에서는 연산 비용이 극도로 낮은 BoW 정책이 여전히 실무적인 경제성 정책으로 선호됨.
+### 매 특징
+- **Order-invariant**: "I eat apple" = "apple eat I".
+- **Sparse**: 매 vocab 10K, 매 doc 의 100 word — 99% 가 0.
+- **High-dim**: 매 vocab size = 매 dim.
+- **Fast**: 매 linear in doc length.
+- **Interpretable**: 매 feature 가 word.

-## 🔗 지식 연결 (Graph)
- Natural Language [[Processing|Processing]] (NLP), [[Word-Representation|Word-Representation]], [[Attention Mechanisms|Attention Mechanisms]], Pattern Recognition, [[Technical-Architecture|Technical-Architecture]]
- **Modern Tech/Tools**: Scikit-learn CountVectorizer, NLTK, Gensim.
---
+### 매 변형

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+#### Pure BoW (count)
+- 매 단순 frequency.
+- 매 common word ("the", "a") 의 dominate.

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+#### TF-IDF
+$$tfidf(t, d) = tf(t, d) \cdot \log\frac{N}{df(t)}$$
+- 매 common 의 down-weight.
+- 매 rare + frequent in doc 의 boost.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+#### N-gram
+- 매 unigram (1 word).
+- 매 bigram (2 word: "New York").
+- 매 trigram.
+- → 매 limited order capture.

-## 🧪 검증 상태 (Validation)
+#### Hashing trick
+- 매 vocabulary build X.
+- 매 word → hash → bucket.
+- 매 streaming + memory OK.
+- 매 collision 의 cost.

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+### vs Word Embedding
+| 측면 | BoW | Embedding |
+|---|---|---|
+| Dim | High (vocab) | Low (~300) |
+| Sparse | ✓ | ✗ |
+| Semantic | ✗ | ✓ |
+| Order | ✗ | ✗ (Word2Vec) / ✓ (Transformer) |
+| Speed | Fast | Slow |
+| Memory | High | Low |
+| Interpretable | High | Low |

-## 🧬 중복 검사 (Duplicate Check)
+### 매 still useful
+1. **Spam classification**: 매 fast + accurate.
+2. **Topic modeling** (LDA): 매 BoW 기반.
+3. **Document retrieval** (BM25): 매 IR 의 baseline.
+4. **Quick prototyping**: 매 transformer overkill.
+5. **Interpretability**: 매 feature importance.
+6. **Resource-constrained**: 매 edge / mobile.

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+## 💻 패턴

-## 🕓 변경 이력 (Changelog)
+### Scikit-learn CountVectorizer
+```python
+from sklearn.feature_extraction.text import CountVectorizer

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+corpus = ['I love NLP', 'NLP is fun', 'I love coding']
+vectorizer = CountVectorizer()
+X = vectorizer.fit_transform(corpus)

-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+print(vectorizer.get_feature_names_out())
+# ['coding', 'fun', 'is', 'love', 'nlp']
+print(X.toarray())
+# [[0 0 0 1 1] [0 1 1 0 1] [1 0 0 1 0]]
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### TF-IDF
+```python
+from sklearn.feature_extraction.text import TfidfVectorizer

-**선택 A를 써야 할 때:**
- *(TODO)*
+vectorizer = TfidfVectorizer(
+    ngram_range=(1, 2),
+    max_features=10_000,
+    min_df=2,
+    max_df=0.95,
+    stop_words='english',
+)
+X = vectorizer.fit_transform(corpus)
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+### Spam classifier (BoW + Naive Bayes)
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.naive_bayes import MultinomialNB

-**기본값:**
-> *(TODO)*
+pipe = Pipeline([
+    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
+    ('clf', MultinomialNB()),
+])
+pipe.fit(X_train, y_train)
+print(pipe.score(X_test, y_test))
+```

-## ❌ 안티패턴 (Anti-Patterns)
+→ 매 transformer 의 overkill 의 case.

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+### BM25 (modern IR)
+```python
+from rank_bm25 import BM25Okapi
+
+corpus = [doc.split() for doc in documents]
+bm25 = BM25Okapi(corpus)
+query = 'machine learning algorithm'.split()
+scores = bm25.get_scores(query)
+top_k = np.argsort(scores)[-5:][::-1]
+```
+
+### Hashing vectorizer (streaming)
+```python
+from sklearn.feature_extraction.text import HashingVectorizer
+
+vectorizer = HashingVectorizer(n_features=2**18, alternate_sign=False)
+# 매 fit X — 매 streaming OK
+for batch in stream:
+    X = vectorizer.transform(batch)
+    model.partial_fit(X, y)
+```
+
+### Topic modeling (LDA)
+```python
+from sklearn.decomposition import LatentDirichletAllocation
+
+vectorizer = CountVectorizer(max_features=5000, stop_words='english')
+X = vectorizer.fit_transform(documents)
+lda = LatentDirichletAllocation(n_components=10, random_state=42)
+lda.fit(X)
+
+# 매 topic 의 top word
+for topic_idx, topic in enumerate(lda.components_):
+    top = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]]
+    print(f'Topic {topic_idx}: {top}')
+```
+
+## 🤔 결정 기준
+| 상황 | Approach |
+|---|---|
+| Fast prototype | TF-IDF + LinearSVC |
+| Spam / topic class | TF-IDF + Naive Bayes |
+| Document retrieval | BM25 |
+| Topic modeling | BoW + LDA |
+| Semantic search | Embedding (NOT BoW) |
+| QA / generation | Transformer (NOT BoW) |
+| Resource-constrained | Hashing vectorizer |
+
+**기본값**: 매 baseline = TF-IDF + LinearSVC. 매 result 의 transformer 와 비교.
+
+## 🔗 Graph
+- 부모: [[NLP]] · [[Text-Representation]] · [[Information-Retrieval]]
+- 변형: [[TF-IDF]] · [[N-gram]] · [[Hashing-Trick]] · [[BM25]]
+- 응용: [[Spam-Classification]] · [[Topic-Modeling]] · [[LDA]] · [[Document-Retrieval]]
+- 대체: [[Word2Vec]] · [[Sentence-Transformers]] · [[BERT]] · [[Embedding]]
+- Adjacent: [[Naive-Bayes]] · [[Linear-SVM]] · [[Stop-Words]] · [[Stemming]]
+
+## 🤖 LLM 활용
+**언제**: 매 baseline. 매 fast classifier. 매 interpretability 필요. 매 IR. 매 topic modeling.
+**언제 X**: 매 semantic similarity. 매 generation. 매 long-context understanding. 매 word order matter.
+
+## ❌ 안티패턴
+- **No stop word removal** (small vocab): 매 noise.
+- **No min_df / max_df**: 매 typo / common 의 dominate.
+- **Vocab 의 fit on test**: 매 leakage.
+- **High-dim 의 dense conversion**: 매 OOM.
+- **Word order matter 한 task 의 BoW**: 매 wrong tool.
+- **모든 task 의 BERT**: 매 BoW 의 fast 의 lose.
+
+## 🧪 검증 / 중복
+- Verified (Manning IR, scikit-learn docs).
+- 신뢰도 A.
+- Related: [[TF-IDF]] · [[BM25]] · [[Word2Vec]] · [[Naive-Bayes]].
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — TF-IDF + N-gram + BM25 + 매 sklearn code |