"매 두 vector 의 close 의 measure". Classical IR (Salton, 1970s) → modern embedding-based retrieval (RAG with Claude Opus 4.7, 2026) — 매 cosine 의 default, 매 task-specific tuning 의 critical.
importfaissimportnumpyasnpembeddings=np.random.randn(1_000_000,768).astype('float32')faiss.normalize_L2(embeddings)# for cosine via inner productindex=faiss.IndexFlatIP(768)# inner product = cosine on normalizedindex.add(embeddings)query=np.random.randn(1,768).astype('float32')faiss.normalize_L2(query)D,I=index.search(query,k=10)# top-10
RAG with embeddings (2026)
fromanthropicimportAnthropicimportvoyageaivo=voyageai.Client()client=Anthropic()defembed(texts):r=vo.embed(texts,model="voyage-3",input_type="document")returnnp.array(r.embeddings)docs_emb=embed(corpus)q_emb=vo.embed([query],model="voyage-3",input_type="query").embeddings[0]scores=docs_emb@np.array(q_emb)# cosine on normalizedtop_k=np.argsort(scores)[-5:][::-1]
Jaccard for sets
defjaccard(a:set,b:set)->float:ifnotaandnotb:return1.0returnlen(a&b)/len(a|b)# MinHash for approximate Jaccard at scalefromdatasketchimportMinHashm1,m2=MinHash(),MinHash()forwindoc1.split():m1.update(w.encode())forwindoc2.split():m2.update(w.encode())print(m1.jaccard(m2))