--- id: NLP-TF-IDF-001 category: Dev confidence_score: 1.0 tags: [ai, nlp, tf-idf, information-retrieval, [[Text-Mining|Text-Mining]], keyword-extraction, [[Search|Search]]-engine] last_reinforced: 2026-04-26 --- # Term Frequency-Inverse Document Frequency (TF-IDF) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "λ¬Έμ„œ λ‚΄ λΉˆλ„λŠ” λ†’λ˜ 전체 λ¬Έμ„œκ΅°μ—μ„œλŠ” ν¬κ·€ν•œ 단어에 κ°€μ€‘μΉ˜λ₯Ό λΆ€μ—¬ν•˜μ—¬, ν”ν•œ μ†ŒμŒ(Stopwords)을 κ±·μ–΄λ‚΄κ³  λ¬Έμ„œμ˜ κ³ μœ ν•œ '정체성'을 κ²°μ •μ§“λŠ” 핡심 ν‚€μ›Œλ“œλ₯Ό μΆ”μΆœν•˜λΌ" β€” ν…μŠ€νŠΈ λ°μ΄ν„°μ—μ„œ νŠΉμ • 단어가 λ¬Έμ„œ λ‚΄μ—μ„œ κ°€μ§€λŠ” 톡계적 μ€‘μš”λ„λ₯Ό κ³„μ‚°ν•˜λŠ” 수치적 μ§€ν‘œ. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** "Local Relevance and Global Rarity Balance" β€” νŠΉμ • λ¬Έμ„œμ— 자주 λ“±μž₯ν•˜λŠ” 단어($TF$)의 μ μˆ˜λŠ” 높이고, λͺ¨λ“  λ¬Έμ„œμ— ν”ν•˜κ²Œ λ“±μž₯ν•˜λŠ” 단어($IDF$)의 μ μˆ˜λŠ” κΉŽμ•„μ„œ, ν•΄λ‹Ή λ¬Έμ„œλ₯Ό κ°€μž₯ 잘 λŒ€ν‘œν•˜λŠ” νŠΉμ§•μ„ μΆ”μΆœν•˜λŠ” νŒ¨ν„΄. - **핡심 μˆ˜μ‹:** $TF-IDF(t, d, D) = TF(t, d) \times IDF(t, D)$ - **TF (Term Frequency):** λ¬Έμ„œ $d$에 단어 $t$κ°€ λ‚˜νƒ€λ‚˜λŠ” λΉˆλ„. - **IDF (Inverse Document Frequency):** 단어 $t$κ°€ ν¬ν•¨λœ λ¬Έμ„œμ˜ λΉ„μœ¨μ˜ μ—­μˆ˜μ— 둜그λ₯Ό μ·¨ν•œ κ°’. - **의의:** 검색 μ—”μ§„μ˜ λ¬Έμ„œ λž­ν‚Ή, ν…μŠ€νŠΈ μš”μ•½, μœ μ‚¬λ„ μΈ‘μ • λ“± 초기 μžμ—°μ–΄ 처리 및 정보 검색 기술의 κ°€μž₯ κ°•λ ₯ν•˜κ³  직관적인 기초 도ꡬ. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λ‹¨μ–΄μ˜ μˆœμ„œλ‚˜ λ§₯락을 λ¬΄μ‹œν•˜λŠ” 'Bag-of-Words' λ°©μ‹μ˜ ν•œκ³„ λ•Œλ¬Έμ— λ”₯λŸ¬λ‹ μž„λ² λ”©([[BERT|BERT]] λ“±)에 자리λ₯Ό λ‚΄μ£Όμ—ˆμœΌλ‚˜, μ—¬μ „νžˆ ν‚€μ›Œλ“œ 기반 κ²€μƒ‰μ΄λ‚˜ 데이터 μ „μ²˜λ¦¬μ˜ 기쀀점(Baseline)μœΌλ‘œμ„œ 압도적인 μ—°μ‚° νš¨μœ¨μ„±κ³Ό 해석λ ₯을 μ œκ³΅ν•¨. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈλŠ” 1,174개 지식 λ¬Έμ„œμ˜ 초기 μžλ™ λΆ„λ₯˜ 및 핡심 νƒœκ·Έ μΆ”μΆœ μ‹œ, μ—°μ‚° μžμ›μ„ μ΅œμ†Œν™”ν•˜λ©΄μ„œλ„ 정확도가 높은 TF-IDF μ•Œκ³ λ¦¬μ¦˜μ„ 1μ°¨ 필터링 μ—”μ§„μœΌλ‘œ ν™œμš©ν•¨. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Natural-Language-Processing|Natural-Language-[[Processing]]-NLP]], [[Semantic-Search-with-AI|Semantic-Search-with-AI]], [[Sparse-Data-Handling|Sparse-Data-Handling]], [[Similarity-Metrics-in-AI|Similarity-Metrics-in-AI]] - **Raw Source:** 10_Wiki/Topics/AI/Term-Frequency-Inverse-Document-Frequency.md