--- id: [[P-Reinforce|P-Reinforce]]-AUTO-VSR-001 category: AI_and_ML confidence_score: 1.00 tags: [auto-reinforced, vector-search, ann, semantic-similarity, information-retrieval] last_reinforced: 2026-05-04 --- # [[Vector Search|Vector Search]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "ํ‚ค์›Œ๋“œ ๋งค์นญ์—์„œ ์˜๋ฏธ ๋งค์นญ์œผ๋กœ: ๋‹จ์ˆœํ•œ ๋‹จ์–ด์˜ ์ผ์น˜ ์—ฌ๋ถ€๋ฅผ ๋„˜์–ด, ๊ณ ์ฐจ์› ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ ์‚ฌ์šฉ์ž์˜ '์˜๋„'์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋งฅ๋ฝ์˜ ์ •๋ณด๋ฅผ ์ฐพ์•„๋‚ด๋Š” ์ˆ˜ํ•™์  ๊ฒ€์ƒ‰ ๊ธฐ๋ฒ•." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ๋ฒกํ„ฐ ๊ฒ€์ƒ‰(Vector Search)์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์ฐจ์› ๊ณต๊ฐ„์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๊ณ , ์งˆ์˜ ๋ฒกํ„ฐ์™€์˜ ๊ฑฐ๋ฆฌ(Similarity)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ ๋†’์€ ํ•ญ๋ชฉ์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. 1. **์ „ํ†ต์  ๊ฒ€์ƒ‰ vs ๋ฒกํ„ฐ ๊ฒ€์ƒ‰**: * **์ „ํ†ต์  ๊ฒ€์ƒ‰ ([[Keyword Search|Keyword Search]])**: ๋‹จ์–ด์˜ ์กด์žฌ ์œ ๋ฌด(TF-IDF, BM25)์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. '์‚ฌ๊ณผ'๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋ฉด 'Apple'์ด ํฌํ•จ๋œ ๋ฌธ์„œ๋ฅผ ๋†“์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **๋ฒกํ„ฐ ๊ฒ€์ƒ‰ ([[Semantic Search|Semantic Search]])**: ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค. '์•„์ดํฐ ์ œ์กฐ์‚ฌ'๋ฅผ ๊ฒ€์ƒ‰ํ•ด๋„ 'Apple' ๊ด€๋ จ ๋ฌธ์„œ๋ฅผ ์ •ํ™•ํžˆ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 2. **ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: [[ANN (Approximate Nearest Neighbor)|ANN]]**: ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋ชจ๋“  ๋ฒกํ„ฐ๋ฅผ ์ „์ˆ˜ ์กฐ์‚ฌ(Brute-force)ํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ๊ทผ์‚ฌ์น˜๋ฅผ ๋น ๋ฅด๊ฒŒ ์ฐพ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. * **[[HNSW|HNSW]]**: ๋…ธ๋“œ ๊ฐ„์˜ ๊ทผ์ ‘ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ณ„์ธต์ ์œผ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๊ณ ์† ํƒ์ƒ‰์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. * **[[Product Quantization (PQ)|PQ]]**: ๋ฒกํ„ฐ๋ฅผ ์••์ถ•ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. * **[[IVF|IVF]]**: ๊ณต๊ฐ„์„ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๊ฒ€์ƒ‰ ๋ฒ”์œ„๋ฅผ ๊ตญ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. 3. **์œ ์‚ฌ๋„ ์ธก์ • ์ง€ํ‘œ (Distance Metrics)**: * **Cosine Similarity**: ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฐ๋„๋ฅผ ์ธก์ • (๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ). * **Euclidean Distance (L2)**: ๋‘ ์  ์‚ฌ์ด์˜ ์ง์„  ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •. * **Dot Product**: ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ์™€ ๋ฐฉํ–ฅ์„ ๋ชจ๋‘ ๊ณ ๋ ค. ## โš–๏ธ Trade-offs & Caveats * **์ •ํ™•๋„์™€ ์†๋„์˜ ๊ท ํ˜•**: [[ANN|ANN]] ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ฒ€์ƒ‰ ์†๋„๋Š” ๋น„์•ฝ์ ์œผ๋กœ ๋นจ๋ผ์ง€์ง€๋งŒ, 100% ์™„๋ฒฝํ•œ ์ตœ์ ํ•ด๋ฅผ ๋ณด์žฅํ•˜์ง€ ๋ชปํ•˜๋Š” ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. * **์ปดํ“จํŒ… ์˜ค๋ฒ„ํ—ค๋“œ**: ๋ฒกํ„ฐ ๋ณ€ํ™˜(Embedding) ๋ฐ ๊ณ ์ฐจ์› ์—ฐ์‚ฐ ๊ณผ์ •์—์„œ ๊ธฐ์กด ๊ฒ€์ƒ‰ ๋Œ€๋น„ ํ›จ์”ฌ ๋งŽ์€ ๋ฆฌ์†Œ์Šค๋ฅผ ์†Œ๋ชจํ•ฉ๋‹ˆ๋‹ค. * **๋‹จ์ˆœ ์ฟผ๋ฆฌ์˜ ๋น„ํšจ์œจ์„ฑ**: ๋ชจ๋ธ๋ช…์ด๋‚˜ ํŠน์ • ID ๊ฒ€์ƒ‰๊ณผ ๊ฐ™์€ Exact Match ์ž‘์—…์—์„œ๋Š” ์˜คํžˆ๋ ค ์ „ํ†ต์ ์ธ [[BM25|BM25]]๋ณด๋‹ค ๋А๋ฆฌ๊ฑฐ๋‚˜ ๋ถ€์ •ํ™•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์ด ๋•Œ๋ฌธ์— ์ตœ๊ทผ์—๋Š” ๋‘ ๋ฐฉ์‹์„ ๊ฒฐํ•ฉํ•œ [[Hybrid Search|Hybrid Search]]๊ฐ€ ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค.) ## ๐Ÿ’ป ์‹ค์ „ ๊ตฌํ˜„ ์ฝ”๋“œ (Boilerplate) `scikit-learn`์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ํ•ต์‹ฌ ๋กœ์ง ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. ```python from sklearn.metrics.pairwise import cosine_similarity import numpy as np # 1. ์˜ˆ์‹œ ๋ฒกํ„ฐ (์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์—์„œ ์ถœ๋ ฅ๋œ ๊ฐ’์ด๋ผ๊ณ  ๊ฐ€์ •) query_vector = np.array([[0.1, 0.2, 0.8]]) # "์ธ๊ณต์ง€๋Šฅ ์ง€์‹" doc_vector_1 = np.array([[0.12, 0.18, 0.75]]) # "AI ์ง€์‹ ๊ฐ•ํ™”" doc_vector_2 = np.array([[0.9, 0.1, 0.05]]) # "์˜ค๋Š˜์˜ ๋‚ ์”จ" # 2. ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ sim_1 = cosine_similarity(query_vector, doc_vector_1) sim_2 = cosine_similarity(query_vector, doc_vector_2) print(f"Similarity with Doc 1 (Related): {sim_1[0][0]:.4f}") print(f"Similarity with Doc 2 (Unrelated): {sim_2[0][0]:.4f}") # 3. ๊ฒฐ๊ณผ ํ•ด์„ if sim_1 > 0.85: print("๋ฌธ๋งฅ์ ์œผ๋กœ ๋งค์šฐ ์œ ์‚ฌํ•œ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค.") ``` ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **๊ธฐ๋ฐ˜ ๊ธฐ์ˆ **: [[Vector Embedding|Vector Embedding]], [[Vector Database|Vector Database]] * **ํ™œ์šฉ ๋ถ„์•ผ**: [[Retrieval-Augmented Generation (RAG)|RAG]], [[Semantic Search|Semantic Search]], [[Hybrid Search|Hybrid Search]] * **๊ณ ๋„ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜**: [[ANN|ANN]], [[HNSW|HNSW]], [[Product Quantization|PQ]] --- *Last updated: 2026-05-04*