--- id: wiki-2026-0508-word-representation title: Word Representation category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Word Embeddings, Distributional Semantics, Word Vectors] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [nlp, embeddings, word2vec, glove, fasttext, contextual] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: gensim/sentence-transformers/PyTorch --- # Word Representation ## 매 한 줄 > **"매 단어를 vector 로 — 매 distributional hypothesis 의 수학화 (Firth 1957: 'a word is known by the company it keeps')"**. 1990s LSA 의 SVD 부터 2013 word2vec, 2014 GloVe, 2016 fastText, 2018 ELMo/BERT contextual embeddings, 2024-2026 Matryoshka & adaptive-dim embeddings 까지 evolutionary trajectory. 2026 현재 매 production NLP 의 starting point — text-embedding-3, voyage-3, BGE-M3 등이 default. ## 매 핵심 ### 매 categories - **One-hot / count-based**: 매 단순 vocab indicator. Sparse. 매 useful baseline. - **TF-IDF / BM25**: 매 frequency weighting — sparse, interpretable. - **LSA / LDA**: 매 SVD / topic model — dense, low-dim (~300). - **Static embeddings**: word2vec (Skip-gram, CBOW), GloVe, fastText. 매 단어당 single vector — polysemy 처리 못 함. - **Contextual embeddings**: ELMo, BERT, RoBERTa — 매 같은 단어, 다른 context, different vector. - **Sentence/passage embeddings**: SBERT, E5, BGE, voyage — 매 retrieval/RAG 의 default. - **Matryoshka embeddings (2024)**: 매 single model, multi-resolution (64/128/256/512/1024 dim) — flexible cost/quality. ### 매 word2vec 핵심 - **Skip-gram**: center word → context words 예측 (rare word 에 좋음). - **CBOW**: context words → center word 예측 (frequent word 에 빠름). - **Negative sampling**: 매 softmax 대체 — k개 negative noise 만 update, 매 huge vocab scale. - **벡터 산술**: king − man + woman ≈ queen (analogy). ### 매 GloVe 차이 - **Global co-occurrence matrix factorization** — word2vec 의 local sliding window 와 보완. - **Loss**: weighted least squares on log(co-occurrence count). ### 매 contextual 의 부상 - 매 "bank" (river / financial) 매 single vector 한계 → BERT 의 token-level contextual representation. - 매 transfer learning 의 폭발 — 매 frozen embedding 위에 task-specific head. ### 매 응용 1. Semantic search / RAG (cosine similarity over embedding). 2. Clustering / topic modeling (k-means on doc embeddings). 3. Classification feature (linear probe). 4. Recommendation (item embeddings). 5. Anomaly detection (outlier in embedding space). ## 💻 패턴 ### 1. word2vec 학습 (gensim) ```python from gensim.models import Word2Vec sentences = [["cat", "sat", "on", "mat"], ["dog", "ran", "fast"], ...] model = Word2Vec( sentences, vector_size=300, window=5, min_count=5, sg=1, # skip-gram negative=10, # negative sampling workers=8, epochs=10, ) print(model.wv.most_similar("cat", topn=5)) print(model.wv.similarity("cat", "dog")) # Analogy print(model.wv.most_similar( positive=["king", "woman"], negative=["man"], topn=3 )) ``` ### 2. Pre-trained GloVe 로드 ```python import numpy as np def load_glove(path: str) -> dict[str, np.ndarray]: embeddings = {} with open(path, "r", encoding="utf-8") as f: for line in f: parts = line.rstrip().split(" ") embeddings[parts[0]] = np.asarray(parts[1:], dtype=np.float32) return embeddings glove = load_glove("glove.840B.300d.txt") ``` ### 3. fastText subword (OOV 처리) ```python import fasttext model = fasttext.train_unsupervised("corpus.txt", model="skipgram", dim=300, minn=3, maxn=6) # OOV 단어도 subword 로 vector 생성 print(model.get_word_vector("unseenword").shape) # (300,) ``` ### 4. Contextual embedding (sentence-transformers) ```python from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("BAAI/bge-m3") # 2024 SOTA multilingual docs = [ "The cat sat on the mat.", "A feline rested on the rug.", "Stock market closed higher today.", ] emb = model.encode(docs, normalize_embeddings=True) sim = emb @ emb.T print(sim) # [[1.0, 0.81, 0.12], # [0.81, 1.0, 0.11], # [0.12, 0.11, 1.0 ]] ``` ### 5. Matryoshka embedding (truncate dim, 2024) ```python # Embed once, query at multiple resolutions model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5") full = model.encode("hello world", normalize_embeddings=True) # (768,) # Truncate + renormalize for storage tier def truncate(v: np.ndarray, dim: int) -> np.ndarray: t = v[:dim] return t / np.linalg.norm(t) low_storage = truncate(full, 64) # Hot index medium = truncate(full, 256) # Warm full_quality = full # Cold rerank ``` ### 6. RAG retrieval (vector DB) ```python from chromadb import Client from sentence_transformers import SentenceTransformer embedder = SentenceTransformer("intfloat/e5-large-v2") client = Client() col = client.create_collection("docs") col.add( ids=[f"d{i}" for i in range(len(docs))], embeddings=embedder.encode(docs).tolist(), documents=docs, ) result = col.query( query_embeddings=embedder.encode(["search query"]).tolist(), n_results=5, ) ``` ### 7. OpenAI text-embedding-3 (production) ```python from openai import OpenAI client = OpenAI() # 3-large can output truncated dims (Matryoshka) resp = client.embeddings.create( model="text-embedding-3-large", input=["doc 1", "doc 2"], dimensions=512, # truncate from 3072 default ) vecs = [d.embedding for d in resp.data] ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Quick prototype, classical NLP | word2vec / GloVe (gensim) | | OOV / morphologically rich language (Korean, Finnish) | fastText subword | | Modern semantic search / RAG | sentence-transformers (BGE-M3, E5, gte) or OpenAI/Voyage API | | Multilingual retrieval | BGE-M3, multilingual-e5-large | | Storage cost critical | Matryoshka — truncate to 64/128 dim | | Domain-specific (legal, medical) | Fine-tune contrastive (e.g., BAAI bge-finetune) | **기본값**: BGE-M3 (open) or text-embedding-3-large (managed) — 매 modern RAG pipeline 의 baseline. ## 🔗 Graph - 부모: [[NLP]] · [[Distributional Semantics]] - 변형: [[ColBERT]] - 응용: [[RAG]] · [[Semantic Search]] - Adjacent: [[Tokenization]] ## 🤖 LLM 활용 **언제**: 매 retrieval, clustering, classification feature 가 필요할 때 — 매 modern NLP pipeline 의 거의 모든 곳. **언제 X**: 매 generative task 자체는 LLM completion 이 우월. 매 keyword exact match 는 BM25 가 빠르고 강함. ## ❌ 안티패턴 - **Pre-trained embedding 사용하면서 매 normalize 안 함**: 매 cosine similarity 가 dot product 와 의미 달라짐. - **Static word2vec 으로 polysemy task 처리**: 매 contextual 모델 필요. - **Mean pooling 으로 sentence vector 생성**: 매 BERT raw mean 매 sentence-transformers fine-tuned 보다 매 훨씬 약함. - **PCA 로 임의 차원 축소**: 매 Matryoshka 가 task-aware shorter dim 더 우월. ## 🧪 검증 / 중복 - Verified (Mikolov et al. 2013, Pennington et al. 2014, Reimers & Gurevych 2019, BGE-M3 paper 2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — word2vec→Matryoshka full evolution + RAG patterns |