"매 단어를 vector 로 — 매 distributional hypothesis 의 수학화 (Firth 1957: 'a word is known by the company it keeps')". 1990s LSA 의 SVD 부터 2013 word2vec, 2014 GloVe, 2016 fastText, 2018 ELMo/BERT contextual embeddings, 2024-2026 Matryoshka & adaptive-dim embeddings 까지 evolutionary trajectory. 2026 현재 매 production NLP 의 starting point — text-embedding-3, voyage-3, BGE-M3 등이 default.
매 핵심
매 categories
One-hot / count-based: 매 단순 vocab indicator. Sparse. 매 useful baseline.
TF-IDF / BM25: 매 frequency weighting — sparse, interpretable.
LSA / LDA: 매 SVD / topic model — dense, low-dim (~300).
Static embeddings: word2vec (Skip-gram, CBOW), GloVe, fastText. 매 단어당 single vector — polysemy 처리 못 함.
Contextual embeddings: ELMo, BERT, RoBERTa — 매 같은 단어, 다른 context, different vector.
Sentence/passage embeddings: SBERT, E5, BGE, voyage — 매 retrieval/RAG 의 default.
Matryoshka embeddings (2024): 매 single model, multi-resolution (64/128/256/512/1024 dim) — flexible cost/quality.
매 word2vec 핵심
Skip-gram: center word → context words 예측 (rare word 에 좋음).
CBOW: context words → center word 예측 (frequent word 에 빠름).
Negative sampling: 매 softmax 대체 — k개 negative noise 만 update, 매 huge vocab scale.
벡터 산술: king − man + woman ≈ queen (analogy).
매 GloVe 차이
Global co-occurrence matrix factorization — word2vec 의 local sliding window 와 보완.
Loss: weighted least squares on log(co-occurrence count).
매 contextual 의 부상
매 "bank" (river / financial) 매 single vector 한계 → BERT 의 token-level contextual representation.
매 transfer learning 의 폭발 — 매 frozen embedding 위에 task-specific head.
매 응용
Semantic search / RAG (cosine similarity over embedding).
Clustering / topic modeling (k-means on doc embeddings).
importfasttextmodel=fasttext.train_unsupervised("corpus.txt",model="skipgram",dim=300,minn=3,maxn=6)# OOV 단어도 subword 로 vector 생성print(model.get_word_vector("unseenword").shape)# (300,)
4. Contextual embedding (sentence-transformers)
fromsentence_transformersimportSentenceTransformerimportnumpyasnpmodel=SentenceTransformer("BAAI/bge-m3")# 2024 SOTA multilingualdocs=["The cat sat on the mat.","A feline rested on the rug.","Stock market closed higher today.",]emb=model.encode(docs,normalize_embeddings=True)sim=emb@emb.Tprint(sim)# [[1.0, 0.81, 0.12],# [0.81, 1.0, 0.11],# [0.12, 0.11, 1.0 ]]
5. Matryoshka embedding (truncate dim, 2024)
# Embed once, query at multiple resolutionsmodel=SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")full=model.encode("hello world",normalize_embeddings=True)# (768,)# Truncate + renormalize for storage tierdeftruncate(v:np.ndarray,dim:int)->np.ndarray:t=v[:dim]returnt/np.linalg.norm(t)low_storage=truncate(full,64)# Hot indexmedium=truncate(full,256)# Warmfull_quality=full# Cold rerank
언제: 매 retrieval, clustering, classification feature 가 필요할 때 — 매 modern NLP pipeline 의 거의 모든 곳.
언제 X: 매 generative task 자체는 LLM completion 이 우월. 매 keyword exact match 는 BM25 가 빠르고 강함.
❌ 안티패턴
Pre-trained embedding 사용하면서 매 normalize 안 함: 매 cosine similarity 가 dot product 와 의미 달라짐.
Static word2vec 으로 polysemy task 처리: 매 contextual 모델 필요.
Mean pooling 으로 sentence vector 생성: 매 BERT raw mean 매 sentence-transformers fine-tuned 보다 매 훨씬 약함.
PCA 로 임의 차원 축소: 매 Matryoshka 가 task-aware shorter dim 더 우월.
🧪 검증 / 중복
Verified (Mikolov et al. 2013, Pennington et al. 2014, Reimers & Gurevych 2019, BGE-M3 paper 2024).
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — word2vec→Matryoshka full evolution + RAG patterns