--- id: wiki-2026-0508-term-frequency-inverse-document- title: Term Frequency Inverse Document Frequency category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [NLP-TF-IDF-001] duplicate_of: none source_trust_level: A confidence_score: 1.0 tags: [ai, nlp, tf-idf, information-retrieval, Text-Mining, keyword-extraction, Search-engine] raw_sources: [] last_reinforced: 2026-04-26 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) --- # Term Frequency-Inverse Document Frequency (TF-IDF) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "λ¬Έμ„œ λ‚΄ λΉˆλ„λŠ” λ†’λ˜ 전체 λ¬Έμ„œκ΅°μ—μ„œλŠ” ν¬κ·€ν•œ 단어에 κ°€μ€‘μΉ˜λ₯Ό λΆ€μ—¬ν•˜μ—¬, ν”ν•œ μ†ŒμŒ(Stopwords)을 κ±·μ–΄λ‚΄κ³  λ¬Έμ„œμ˜ κ³ μœ ν•œ '정체성'을 κ²°μ •μ§“λŠ” 핡심 ν‚€μ›Œλ“œλ₯Ό μΆ”μΆœν•˜λΌ" β€” ν…μŠ€νŠΈ λ°μ΄ν„°μ—μ„œ νŠΉμ • 단어가 λ¬Έμ„œ λ‚΄μ—μ„œ κ°€μ§€λŠ” 톡계적 μ€‘μš”λ„λ₯Ό κ³„μ‚°ν•˜λŠ” 수치적 μ§€ν‘œ. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** "Local Relevance and Global Rarity Balance" β€” νŠΉμ • λ¬Έμ„œμ— 자주 λ“±μž₯ν•˜λŠ” 단어($TF$)의 μ μˆ˜λŠ” 높이고, λͺ¨λ“  λ¬Έμ„œμ— ν”ν•˜κ²Œ λ“±μž₯ν•˜λŠ” 단어($IDF$)의 μ μˆ˜λŠ” κΉŽμ•„μ„œ, ν•΄λ‹Ή λ¬Έμ„œλ₯Ό κ°€μž₯ 잘 λŒ€ν‘œν•˜λŠ” νŠΉμ§•μ„ μΆ”μΆœν•˜λŠ” νŒ¨ν„΄. - **핡심 μˆ˜μ‹:** $TF-IDF(t, d, D) = TF(t, d) \times IDF(t, D)$ - **TF (Term Frequency):** λ¬Έμ„œ $d$에 단어 $t$κ°€ λ‚˜νƒ€λ‚˜λŠ” λΉˆλ„. - **IDF (Inverse Document Frequency):** 단어 $t$κ°€ ν¬ν•¨λœ λ¬Έμ„œμ˜ λΉ„μœ¨μ˜ μ—­μˆ˜μ— 둜그λ₯Ό μ·¨ν•œ κ°’. - **의의:** 검색 μ—”μ§„μ˜ λ¬Έμ„œ λž­ν‚Ή, ν…μŠ€νŠΈ μš”μ•½, μœ μ‚¬λ„ μΈ‘μ • λ“± 초기 μžμ—°μ–΄ 처리 및 정보 검색 기술의 κ°€μž₯ κ°•λ ₯ν•˜κ³  직관적인 기초 도ꡬ. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & Updates) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λ‹¨μ–΄μ˜ μˆœμ„œλ‚˜ λ§₯락을 λ¬΄μ‹œν•˜λŠ” 'Bag-of-Words' λ°©μ‹μ˜ ν•œκ³„ λ•Œλ¬Έμ— λ”₯λŸ¬λ‹ μž„λ² λ”©([[BERT|BERT]] λ“±)에 자리λ₯Ό λ‚΄μ£Όμ—ˆμœΌλ‚˜, μ—¬μ „νžˆ ν‚€μ›Œλ“œ 기반 κ²€μƒ‰μ΄λ‚˜ 데이터 μ „μ²˜λ¦¬μ˜ 기쀀점(Baseline)μœΌλ‘œμ„œ 압도적인 μ—°μ‚° νš¨μœ¨μ„±κ³Ό 해석λ ₯을 μ œκ³΅ν•¨. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈλŠ” 1,174개 지식 λ¬Έμ„œμ˜ 초기 μžλ™ λΆ„λ₯˜ 및 핡심 νƒœκ·Έ μΆ”μΆœ μ‹œ, μ—°μ‚° μžμ›μ„ μ΅œμ†Œν™”ν•˜λ©΄μ„œλ„ 정확도가 높은 TF-IDF μ•Œκ³ λ¦¬μ¦˜μ„ 1μ°¨ 필터링 μ—”μ§„μœΌλ‘œ ν™œμš©ν•¨. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Natural-Language-Processing|Natural-Language-[[Processing]]-NLP]], [[Semantic-Search-with-AI|Semantic-Search-with-AI]], [[Sparse-Data-Handling|Sparse-Data-Handling]], [[Similarity-Metrics-in-AI|Similarity-Metrics-in-AI]] - **Raw Source:** 10_Wiki/Topics/AI/Term-Frequency-Inverse-Document-Frequency.md ## πŸ€– LLM ν™œμš© 힌트 (How to Use This Knowledge) **μ–Έμ œ 이 지식을 μ“°λŠ”κ°€:** - *(TODO)* **μ–Έμ œ μ“°λ©΄ μ•ˆ λ˜λŠ”κ°€:** - *(TODO)* ## πŸ§ͺ 검증 μƒνƒœ (Validation) - **정보 μƒνƒœ:** needs_review - **좜처 신뒰도:** A - **κ²€ν†  이유:** *(P-Reinforce Phase 1 μžλ™ μ •κ·œν™”. λ³Έλ¬Έ 검증 ν•„μš”.)* ## 🧬 쀑볡 검사 (Duplicate Check) - **κΈ°μ‘΄ μœ μ‚¬ λ¬Έμ„œ:** *(TODO: μΈλ±μ„œ ν΄λŸ¬μŠ€ν„° 리포트 μ°Έμ‘°)* - **처리 방식:** UPDATE (μžλ™ μ •κ·œν™”) - **처리 이유:** Phase 1 μ •κ·œν™” β€” μ˜› ν…œν”Œλ¦Ώ/λˆ„λ½ ν•„λ“œ 보강. ## πŸ•“ λ³€κ²½ 이λ ₯ (Changelog) | λ‚ μ§œ | λ³€κ²½ λ‚΄μš© | 처리 방식 | 신뒰도 | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 μ •κ·œν™” (frontmatter + 헀더 ν‘œμ€€ν™”) | UPDATE | A |