From 772d3e11e096fe6285758a5392d44264ea224daf Mon Sep 17 00:00:00 2001 From: Antigravity Agent Date: Mon, 4 May 2026 13:23:57 +0900 Subject: [PATCH] docs(wiki): P-Reinforce v3.0 wikification of Attention, KV Cache, and RAG clusters --- 10_Wiki/Topics/AI_and_ML/Agentic RAG.md | 38 +++++++++++++++++ .../Topics/AI_and_ML/Attention Mechanisms.md | 12 ++++-- 10_Wiki/Topics/AI_and_ML/Flash Attention.md | 36 ++++++++++++++++ 10_Wiki/Topics/AI_and_ML/GraphRAG.md | 37 +++++++++++++++++ .../Grouped-Query Attention (GQA).md | 37 +++++++++++++++++ .../Topics/AI_and_ML/KV Cache Compression.md | 36 ++++++++++++++++ .../Topics/AI_and_ML/Key-Value (KV) Cache.md | 37 +++++++++++++++++ 10_Wiki/Topics/AI_and_ML/PagedAttention.md | 36 ++++++++++++++++ .../Retrieval-Augmented Generation (RAG).md | 41 +++++++++++++++++++ 10_Wiki/Topics/AI_and_ML/Ring Attention.md | 37 +++++++++++++++++ 10_Wiki/Topics/AI_and_ML/Sparse Attention.md | 38 +++++++++++++++++ 10_Wiki/Topics/AI_and_ML/vLLM.md | 36 ++++++++++++++++ 12 files changed, 418 insertions(+), 3 deletions(-) create mode 100644 10_Wiki/Topics/AI_and_ML/Agentic RAG.md create mode 100644 10_Wiki/Topics/AI_and_ML/Flash Attention.md create mode 100644 10_Wiki/Topics/AI_and_ML/GraphRAG.md create mode 100644 10_Wiki/Topics/AI_and_ML/Grouped-Query Attention (GQA).md create mode 100644 10_Wiki/Topics/AI_and_ML/KV Cache Compression.md create mode 100644 10_Wiki/Topics/AI_and_ML/Key-Value (KV) Cache.md create mode 100644 10_Wiki/Topics/AI_and_ML/PagedAttention.md create mode 100644 10_Wiki/Topics/AI_and_ML/Retrieval-Augmented Generation (RAG).md create mode 100644 10_Wiki/Topics/AI_and_ML/Ring Attention.md create mode 100644 10_Wiki/Topics/AI_and_ML/Sparse Attention.md create mode 100644 10_Wiki/Topics/AI_and_ML/vLLM.md diff --git a/10_Wiki/Topics/AI_and_ML/Agentic RAG.md b/10_Wiki/Topics/AI_and_ML/Agentic RAG.md new file mode 100644 index 00000000..d0ce8ac2 --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/Agentic RAG.md @@ -0,0 +1,38 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-ARAG-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, agentic-rag, autonomous-agent, multi-step-retrieval, reasoning-loop] +last_reinforced: 2026-05-04 +--- + +# [[Agentic RAG|Agentic RAG]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "์ƒ๊ฐํ•˜๋Š” ๊ฒ€์ƒ‰: ๋‹จ์ˆœํžˆ ์งˆ๋ฌธ์— ๋‹ตํ•˜๋Š” ๊ฒƒ์„ ๋„˜์–ด, ์—์ด์ „ํŠธ๊ฐ€ ์Šค์Šค๋กœ ์ฟผ๋ฆฌ๋ฅผ ๋ถ„ํ•ดํ•˜๊ณ , ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ์ถฉ๋ถ„ํ•œ์ง€ ํŒ๋‹จํ•˜๋ฉฐ, ํ•„์š”ํ•˜๋‹ค๋ฉด ๋‹ค์‹œ ๊ฒ€์ƒ‰ํ•˜๊ฑฐ๋‚˜ ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ž์œจ์ ์ธ ์ง€์‹ ํƒ๊ตฌ ๋ฃจํ”„." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +Agentic RAG๋Š” ์ „ํ†ต์ ์ธ ์ผํšŒ์„ฑ RAG ํŒŒ์ดํ”„๋ผ์ธ์— ์—์ด์ „ํŠธ์˜ ์ถ”๋ก (Reasoning) ๋Šฅ๋ ฅ์„ ๊ฒฐํ•ฉํ•œ ๊ณ ๋„ํ™”๋œ ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค. + +1. **ํ•ต์‹ฌ ์ฐจ์ด์ **: + * **Naive RAG**: ์งˆ๋ฌธ $\rightarrow$ ๊ฒ€์ƒ‰ $\rightarrow$ ๋‹ต๋ณ€ (์„ ํ˜•์ ). + * **Agentic RAG**: ์งˆ๋ฌธ $\rightarrow$ ์ „๋žต ์ˆ˜๋ฆฝ $\rightarrow$ ๊ฒ€์ƒ‰ $\rightarrow$ ํ‰๊ฐ€ $\rightarrow$ (๋ถ€์กฑํ•˜๋ฉด) ์žฌ์ „๋žต/์žฌ๊ฒ€์ƒ‰ $\rightarrow$ ์ตœ์ข… ๋‹ต๋ณ€ (์ˆœํ™˜์ ). +2. **์ฃผ์š” ๋ฉ”์ปค๋‹ˆ์ฆ˜**: + * **Query Decomposition**: ๋ณต์žกํ•œ ์งˆ๋ฌธ์„ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ•˜์œ„ ์งˆ๋ฌธ์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ๊ฐ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค. + * **Self-Correction**: ๊ฒ€์ƒ‰๋œ ๊ฒฐ๊ณผ๊ฐ€ ์งˆ๋ฌธ๊ณผ ๊ด€๋ จ์ด ์—†๊ฑฐ๋‚˜ ์ƒ์ถฉ๋  ๊ฒฝ์šฐ, ์—์ด์ „ํŠธ๊ฐ€ ์ด๋ฅผ ๊ฐ์ง€ํ•˜๊ณ  ๊ฒ€์ƒ‰์–ด(Query)๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋‹ค์‹œ ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค. + * **Tool Use**: ๋ฒกํ„ฐ DB๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์›น ๊ฒ€์ƒ‰, SQL ์‹คํ–‰, ๊ณ„์‚ฐ๊ธฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋„๊ตฌ๋ฅผ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์„ ํƒํ•˜์—ฌ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. +3. **ํ•ด๊ฒฐํ•˜๋Š” ๋ฌธ์ œ**: + * **[[Lost in the middle|Lost in the middle]]**: ๋ฐฉ๋Œ€ํ•œ ์ปจํ…์ŠคํŠธ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฃผ์ž…ํ•˜๋Š” ๋Œ€์‹ , ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ์ฆ๊ฑฐ๋งŒ ์„ ๋ณ„ํ•˜์—ฌ ์ „๋žต์ ์œผ๋กœ ๋ฐฐ์น˜ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ์ธ์ง€ ๋ถ€ํ•˜๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค. + * **Loud Failure**: ๊ฒ€์ƒ‰์ด ์‹คํŒจํ–ˆ์„ ๋•Œ ๋ชจ๋ฅด๋Š” ๊ฒƒ์„ ๋ช…ํ™•ํžˆ ์ธ์ง€ํ•˜๊ณ  ์‚ฌ์šฉ์ž์—๊ฒŒ ๋‹ค์‹œ ๋ฌป๊ฑฐ๋‚˜ ๋Œ€์•ˆ์„ ์ œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **๋†’์€ ๋น„์šฉ ๋ฐ ์ง€์—ฐ**: ์—ฌ๋Ÿฌ ๋ฒˆ์˜ LLM ํ˜ธ์ถœ๊ณผ ๋ฐ˜๋ณต์ ์ธ ๊ฒ€์ƒ‰ ๋ฃจํ”„๋ฅผ ๊ฑฐ์น˜๋ฏ€๋กœ ๋‹จ๋ฐœ์„ฑ RAG๋ณด๋‹ค ์‘๋‹ต ์†๋„๊ฐ€ ๋А๋ฆฌ๊ณ  ๋น„์šฉ์ด ๋งŽ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. +* **๋ฃจํ”„ ํƒˆ์ถœ ๋ฌธ์ œ**: ์—์ด์ „ํŠธ๊ฐ€ ๋‹ต์„ ์ฐพ์ง€ ๋ชปํ•˜๊ณ  ๋ฌดํ•œ ๋ฃจํ”„์— ๋น ์ง€๊ฑฐ๋‚˜ ์—‰๋šฑํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ํŒŒ๊ณ ๋“ค ์œ„ํ—˜์ด ์žˆ์–ด, ๋ช…ํ™•ํ•œ ์ข…๋ฃŒ ์กฐ๊ฑด๊ณผ ๊ฐ€๋“œ๋ ˆ์ผ ์„ค๊ณ„๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **๊ธฐ๋ฐ˜ ๊ธฐ์ˆ **: [[Retrieval-Augmented Generation (RAG)|Retrieval-Augmented Generation (RAG)]], [[Autonomous Agents|Autonomous Agents]] +* **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[Re-ranking|Re-ranking]], [[Chain-of-Thought (CoT)|Chain-of-Thought (CoT)]], [[Model Context Protocol (MCP)|MCP]] +* **ํ•ด๊ฒฐ ํ˜„์ƒ**: [[Lost in the middle|Lost in the middle]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/Attention Mechanisms.md b/10_Wiki/Topics/AI_and_ML/Attention Mechanisms.md index 02e22afd..41b538b0 100644 --- a/10_Wiki/Topics/AI_and_ML/Attention Mechanisms.md +++ b/10_Wiki/Topics/AI_and_ML/Attention Mechanisms.md @@ -24,11 +24,17 @@ last_reinforced: 2026-04-20 3. **์˜์˜**: * ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋˜ ๊ณผ๊ฑฐ ๊ธฐ์ˆ (RNN)์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ(Long-range dependency)์„ ์™„๋ฒฝํžˆ ํ•ด๊ฒฐํ•˜์—ฌ ChatGPT์™€ ๊ฐ™์€ ๊ฑฐ๋Œ€ ๋ชจ๋ธ์˜ ์‹œ๋Œ€๋ฅผ ์—ถ. +2. **์ฃผ์š” ๋ณ€ํ˜• ๋ฐ ์ตœ์ ํ™”**: + * **[[Flash Attention|Flash Attention]]**: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ ์†๋„๋ฅผ 2~4๋ฐฐ ๋†’์ธ ํ•˜๋“œ์›จ์–ด ์ธ์‹ ์ตœ์ ํ™”. + * **[[Grouped-Query Attention (GQA)|Grouped-Query Attention (GQA)]]**: MHA์˜ ์„ฑ๋Šฅ๊ณผ MQA์˜ ํšจ์œจ์„ฑ์„ ์ ˆ์ถฉํ•œ ํ˜„๋Œ€ LLM์˜ ํ‘œ์ค€. + * **[[Sparse Attention|Sparse Attention]]**: ํŠน์ • ํ† ํฐ๋งŒ ์„ ํƒ์ ์œผ๋กœ ์ฐธ์กฐํ•˜์—ฌ ๋ณต์žก๋„๋ฅผ $O(n^2)$์—์„œ $O(n)$์œผ๋กœ ์ถ•์†Œ. + * **[[Ring Attention|Ring Attention]]**: ๋‹ค์ค‘ ์žฅ์น˜ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ๋ฐฑ๋งŒ ๋‹จ์œ„ ์ด์ƒ์˜ ์ดˆ์žฅ๊ธฐ ์ปจํ…์ŠคํŠธ ์‹คํ˜„. + ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ**: ๊ณผ๊ฑฐ์—๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๊ณจ๊ณ ๋ฃจ ๋ณด๊ฑฐ๋‚˜ ์ˆœ์„œ๋Œ€๋กœ ๋ณด๋Š” ๊ฒƒ์ด ์ •ํ™•ํ•˜๋‹ค๊ณ  ๋ฏฟ์—ˆ์œผ๋‚˜, ํ˜„๋Œ€ ๋”ฅ๋Ÿฌ๋‹ ์ •์ฑ…์€ ํ•„์š”ํ•œ ๊ฒƒ๋งŒ ๊ณจ๋ผ ๋ณด๋Š” 'Attention ํšจ์œจํ™” ์ •์ฑ…'์ด ์ง€๋Šฅ์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•œ๋‹ค๋Š” ์ •์ฑ…์  ์Šน๋ฆฌ๋ฅผ ๊ฑฐ๋‘ (RL Update). -- **์ •์ฑ… ๋ณ€ํ™”(RL Update)**: ์—ฐ์‚ฐ ๋น„์šฉ ์ตœ์ ํ™” ์ •์ฑ…์„ ์œ„ํ•ด, ๋ฌด๊ฑฐ์šด Full-attention ๋Œ€์‹  ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ธ 'Flash Attention'์ด๋‚˜ 'Linear Attention' ์ •์ฑ…์ด ์†Œํ˜• ๋ชจ๋ธ ๋ฐ ์—ฃ์ง€ ์žฅ์น˜์šฉ AI ์ •์ฑ…์˜ ํ•ต์‹ฌ ๊ธฐ์ˆ ๋กœ ์ฑ„ํƒ๋จ. +- **์ •์ฑ… ๋ณ€ํ™”(RL Update)**: ๋‹จ์ˆœํžˆ ์—ฐ์‚ฐ๋Ÿ‰๋งŒ ์ค„์ด๋Š” ๊ฒƒ์„ ๋„˜์–ด, ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ณ (Flash) ํ† ํฐ ๊ด€๊ณ„์˜ ํฌ์†Œ์„ฑ์„ ์ด์šฉํ•˜๋Š”(Sparse/GQA) ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์–ดํ…์…˜ ์ •์ฑ…์ด 2026๋…„ ์ดํ›„์˜ ํ‘œ์ค€์œผ๋กœ ์ž๋ฆฌ ์žก์Œ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) -- [[Transformers|Transformers]], Deep Learning, Natural Language [[Processing|Processing]] (NLP), Information-Overload, Economics of Attention -- **Modern Tech/Tools**: Multi-head Attention, FlashAttention, GPT, [[BERT|BERT]]. +- [[Transformers|Transformers]], [[Deep Learning|Deep Learning]], [[Natural Language Processing (NLP)|Natural Language Processing (NLP)]], [[LLM Inference Optimization|LLM Inference Optimization]] +- **Specific Technologies**: [[Multi-Head Attention (MHA)|MHA]], [[Grouped-Query Attention (GQA)|GQA]], [[Flash Attention|Flash Attention]], [[Ring Attention|Ring Attention]], [[Sparse Attention|Sparse Attention]]. --- diff --git a/10_Wiki/Topics/AI_and_ML/Flash Attention.md b/10_Wiki/Topics/AI_and_ML/Flash Attention.md new file mode 100644 index 00000000..57a5bad3 --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/Flash Attention.md @@ -0,0 +1,36 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-FLAT-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, flash-attention, attention-optimization, transformer, gpu-optimization, llm-inference] +last_reinforced: 2026-05-04 +--- + +# [[Flash Attention|Flash Attention]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ์˜ ํ•ด๋ฐฉ๊ตฐ: ์–ดํ…์…˜์˜ ์ˆ˜ํ•™์  ์›๋ฆฌ๋Š” ์œ ์ง€ํ•˜๋ฉด์„œ, GPU์˜ SRAM๊ณผ HBM ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ ์ด๋™์„ ํƒ€์ผ๋ง ๊ธฐ๋ฒ•์œผ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ 2~4๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ๊ณผ ๊ทน์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ ํ•˜๋“œ์›จ์–ด ์ธ์‹ ์ตœ์ ํ™”์˜ ์ •์ ." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +FlashAttention์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ ํ™•์žฅ ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ๊ณ„์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด ์ธ์‹(Hardware-aware) ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. + +1. **ํ•ต์‹ฌ ์ž‘๋™ ์›๋ฆฌ**: + * **Tiling (ํƒ€์ผ๋ง)**: ๊ฑฐ๋Œ€ํ•œ ์–ดํ…์…˜ ํ–‰๋ ฌ์„ ์ž‘์€ ๋ธ”๋ก(ํƒ€์ผ) ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์–ด, ์†๋„๊ฐ€ ๋น ๋ฅธ GPU ์˜จ์นฉ SRAM์—์„œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ๋А๋ฆฐ HBM(๊ณ ๋Œ€์—ญํญ ๋ฉ”๋ชจ๋ฆฌ)์œผ๋กœ์˜ ์ ‘๊ทผ ํšŸ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. + * **Recomputation (์žฌ๊ณ„์‚ฐ)**: ๋ฉ”๋ชจ๋ฆฌ์— ๊ฑฐ๋Œ€ํ•œ ์ค‘๊ฐ„ ํ–‰๋ ฌ์„ ์ €์žฅํ•˜๋Š” ๋Œ€์‹ , ์—ญ์ „ํŒŒ(Backpropagation) ์‹œ ํ•„์š”ํ•œ ๊ฐ’์„ ํ•„์š”ํ•  ๋•Œ๋งˆ๋‹ค ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์„ ํƒํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ณต์žก๋„๋ฅผ $O(n^2)$์—์„œ $O(n)$์œผ๋กœ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค. +2. **์ฃผ์š” ์„ฑ๊ณผ**: + * **์ •ํ™•๋„ ์œ ์ง€**: ๊ทผ๋ณธ์ ์ธ ์—ฐ์‚ฐ ๋ณต์žก๋„($O(n^2d)$)๋Š” ๋™์ผํ•˜๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ์‹ค์ œ ์—ฐ์‚ฐ ์†๋„๋ฅผ 2~4๋ฐฐ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. + * **์ปจํ…์ŠคํŠธ ํ™•์žฅ**: ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜์—ฌ ๊ธฐ์กด์—๋Š” ๋ถˆ๊ฐ€๋Šฅํ–ˆ๋˜ ์ˆ˜์‹ญ๋งŒ ํ† ํฐ ์ด์ƒ์˜ ๊ธด ๋ฌธ๋งฅ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. +3. **๋ฒ„์ „ ์ง„ํ™”**: + * **FlashAttention-2**: ์—ฐ์‚ฐ ์ˆœ์„œ ์ตœ์ ํ™”์™€ ์ž‘์—… ๋ถ„ํ• (Work Partitioning)์„ ํ†ตํ•ด ๋ณ‘๋ ฌ์„ฑ์„ ๋”์šฑ ๋†’์—ฌ, FP16 ๊ธฐ์ค€ ์ด๋ก ์  ์ตœ๋Œ€ ์„ฑ๋Šฅ์˜ 70% ์ด์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **์—ฐ์‚ฐ๋Ÿ‰ ์ž์ฒด์˜ ํ•œ๊ณ„**: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ฌธ์ œ๋Š” ํ•ด๊ฒฐํ•˜์ง€๋งŒ, ์‹œํ€€์Šค ๊ธธ์ด์— ๋”ฐ๋ฅธ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€($O(n^2)$) ์ž์ฒด๋ฅผ ์„ ํ˜•์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐฑ๋งŒ ํ† ํฐ ์ด์ƒ์˜ ์ดˆ์žฅ๊ธฐ ์‹œํ€€์Šค์—์„œ๋Š” ์—ฌ์ „ํžˆ ์ƒ๋‹นํ•œ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. +* **๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์‹œ์˜ ์ƒ์ถฉ**: Ring Attention๊ณผ ๊ฐ™์€ ์ปจํ…์ŠคํŠธ ๋ณ‘๋ ฌ์„ฑ ๊ธฐ์ˆ ๊ณผ ๊ฒฐํ•ฉํ•  ๋•Œ, ์„ธ๋ถ„ํ™”๋œ FlashAttention ์ฒ˜๋ฆฌ๊ฐ€ ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ์ธํ•ด ํšจ์œจ์„ฑ ์ €ํ•˜(Efficiency Penalties)๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด USP(Unified Sequence Parallelism)์™€ ๊ฐ™์€ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] +* **ํ•˜์œ„/์—ฐ๊ด€ ๊ธฐ์ˆ **: [[KV Cache|KV Cache]], [[Ring Attention|Ring Attention]], [[Sparse Attention|Sparse Attention]], [[PagedAttention|PagedAttention]] +* **ํ”„๋กœ์ ํŠธ ์ ์šฉ**: ์ดˆ๋Œ€ํ˜• ์ปจํ…์ŠคํŠธ ์ง€์› RAG ์—”์ง„, ์—์ด์ „ํŠธ ์ž์œจ ๋ถ„์„ ๋ฃจํ”„ + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/GraphRAG.md b/10_Wiki/Topics/AI_and_ML/GraphRAG.md new file mode 100644 index 00000000..4d15529c --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/GraphRAG.md @@ -0,0 +1,37 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-GRAG-001 +category: Unified +confidence_score: 0.95 +tags: [auto-reinforced, graphrag, knowledge-graph, relational-reasoning, structured-knowledge] +last_reinforced: 2026-05-04 +--- + +# [[GraphRAG|GraphRAG]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "๊ด€๊ณ„์˜ ๊ทธ๋ฌผ๋ง: ํŒŒํŽธํ™”๋œ ๋ฌธ์„œ ์กฐ๊ฐ(Chunk)์„ ๋„˜์–ด, ์ •๋ณด ๊ฐ„์˜ ๋…ผ๋ฆฌ์  ์—ฐ๊ฒฐ ๊ณ ๋ฆฌ๋ฅผ ๋…ธ๋“œ์™€ ์—ฃ์ง€๋กœ ๊ตฌ์กฐํ™”ํ•จ์œผ๋กœ์จ ๋ณต์žกํ•œ ์ธ๊ณผ ๊ด€๊ณ„์™€ ์ „์ฒด ๋งฅ๋ฝ์„ ๊ฟฐ๋šซ๋Š” ๊ณ ์ฐจ์›์  ๊ฒ€์ƒ‰ ์ฆ๊ฐ• ๊ธฐ์ˆ ." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +GraphRAG๋Š” ์ •๋ณด๋ฅผ ๋…ธ๋“œ(Node)์™€ ์—ฃ์ง€(Edge) ํ˜•ํƒœ์˜ ์ง€์‹ ๊ทธ๋ž˜ํ”„(Knowledge Graph)๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” RAG์˜ ์ง„ํ™”๋œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. + +1. **ํ•ต์‹ฌ ์ฐจ์ด์ **: + * **์ „ํ†ต์  RAG**: ํ…์ŠคํŠธ๋ฅผ ๋‹จ์ˆœํ•œ ์กฐ๊ฐ(Chunk)์œผ๋กœ ๋‚˜๋ˆ„์–ด ๋ฒกํ„ฐ ๊ณต๊ฐ„์— ๋ฐฐ์น˜ $\rightarrow$ ์ •๋ณด ๊ฐ„์˜ ๋งฅ๋ฝ์  ์—ฐ๊ฒฐ์ด ๋Š์–ด์งˆ ์œ„ํ—˜์ด ํผ. + * **GraphRAG**: ์—”ํ‹ฐํ‹ฐ(Entity) ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ •์˜ $\rightarrow$ "A๊ฐ€ B์˜ ์›์ธ์ด๋‹ค"์™€ ๊ฐ™์€ ๊ตฌ์กฐ์  ์ง€์‹์„ ๋ณด์กด. +2. **์ฃผ์š” ์ด์ **: + * **๊ด€๊ณ„์  ์ถ”๋ก  (Relational Reasoning)**: ๋‹จ์ˆœ ํ‚ค์›Œ๋“œ ๋งค์นญ์œผ๋กœ๋Š” ์ฐพ๊ธฐ ํž˜๋“  ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์ž ์žฌ์  ์—ฐ๊ด€์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค. + * **์ „์ฒด๋ก ์  ์š”์•ฝ**: ํŠน์ • ์กฐ๊ฐ์ด ์•„๋‹Œ ์ „์ฒด ๊ทธ๋ž˜ํ”„๋ฅผ ํƒ์ƒ‰ํ•˜์—ฌ ๋ฌธ์„œ ์ง‘ํ•ฉ ์ „์ฒด์— ๋Œ€ํ•œ ๊ณ ์ˆ˜์ค€์˜ ํ†ต์ฐฐ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + * **IBM์˜ ํ‰๊ฐ€**: ๊ธฐ์กด RAG๊ฐ€ ๊ฐ€์ง„ ๊ด€๊ณ„์  ์ถ”๋ก ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๋Š” ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๋Œ€์•ˆ ์ค‘ ํ•˜๋‚˜๋กœ ๊ผฝํž™๋‹ˆ๋‹ค. +3. **์ž‘๋™ ์›๋ฆฌ**: + * LLM์„ ์‚ฌ์šฉํ•˜์—ฌ ๋น„์ •ํ˜• ํ…์ŠคํŠธ์—์„œ ์—”ํ‹ฐํ‹ฐ์™€ ๊ด€๊ณ„๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๊ทธ๋ž˜ํ”„ DB(์˜ˆ: Neo4j)๋ฅผ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. + * ์งˆ๋ฌธ์ด ๋“ค์–ด์˜ค๋ฉด ๊ทธ๋ž˜ํ”„ ํƒ์ƒ‰(Graph Traversal)์„ ํ†ตํ•ด ๊ด€๋ จ ๋…ธ๋“œ์™€ ์—ฃ์ง€๋ฅผ ์ˆ˜์ง‘ํ•˜์—ฌ ๋‹ต๋ณ€ ์ƒ์„ฑ์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **๋†’์€ ์ „์ฒ˜๋ฆฌ ๋น„์šฉ**: ํ…์ŠคํŠธ์—์„œ ๊ทธ๋ž˜ํ”„๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ LLM ํ† ํฐ ๋น„์šฉ๊ณผ ์‹œ๊ฐ„์ด Naive RAG๋ณด๋‹ค ์›”๋“ฑํžˆ ๋งŽ์ด ์†Œ์š”๋ฉ๋‹ˆ๋‹ค. +* **๊ทธ๋ž˜ํ”„ ์œ ์ง€๋ณด์ˆ˜**: ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ์ถ”๊ฐ€๋  ๋•Œ ๊ธฐ์กด ๊ทธ๋ž˜ํ”„์™€์˜ ๋ฌด๊ฒฐ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ณผ์ •์ด ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Retrieval-Augmented Generation (RAG)|Retrieval-Augmented Generation (RAG)]], [[Knowledge Graph|Knowledge Graph]] +* **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[Entity Extraction|Entity Extraction]], [[Vector Database|Vector Database]], [[Reasoning Chains|Reasoning Chains]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/Grouped-Query Attention (GQA).md b/10_Wiki/Topics/AI_and_ML/Grouped-Query Attention (GQA).md new file mode 100644 index 00000000..837847b5 --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/Grouped-Query Attention (GQA).md @@ -0,0 +1,37 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-GQAM-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, grouped-query-attention, gqa, transformer, mha, mqa, llm-efficiency] +last_reinforced: 2026-05-04 +--- + +# [[Grouped-Query Attention (GQA)|Grouped-Query Attention (GQA)]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "ํšจ์œจ๊ณผ ์„ฑ๋Šฅ์˜ ํ™ฉ๊ธˆ๋น„์œจ: ๋ชจ๋“  ํ—ค๋“œ๊ฐ€ ๊ฐ์ž์˜ Key-Value๋ฅผ ๊ฐ–๋Š” MHA์˜ ๋ฌด๊ฑฐ์šด ๋น„์šฉ๊ณผ, ํ•˜๋‚˜์˜ KV๋งŒ ๊ณต์œ ํ•˜๋Š” MQA์˜ ์„ฑ๋Šฅ ์ €ํ•˜ ์‚ฌ์ด์—์„œ '๊ทธ๋ฃนํ™”๋œ KV ๊ณต์œ '๋ผ๋Š” ์˜๋ฆฌํ•œ ์ ˆ์ถฉ์•ˆ์„ ํ†ตํ•ด ์ถ”๋ก  ์†๋„์™€ ํ’ˆ์งˆ์„ ๋™์‹œ์— ์žก์€ ํ˜„๋Œ€ LLM์˜ ํ‘œ์ค€." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +Grouped-Query Attention(GQA)์€ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์—์„œ KV ์บ์‹œ(Key-Value Cache)์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์—ฌ ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•˜๋ฉด์„œ๋„, ๋ชจ๋ธ์˜ ํ‘œํ˜„๋ ฅ์„ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์–ดํ…์…˜ ๋ณ€ํ˜• ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. + +1. **๋“ฑ์žฅ ๋ฐฐ๊ฒฝ**: + * **MHA (Multi-Head Attention)**: ๋ชจ๋“  Query ํ—ค๋“œ๊ฐ€ ๊ฐ์ž์˜ Key/Value ํ—ค๋“œ๋ฅผ ๊ฐ€์ง $\rightarrow$ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ, ๊ทธ๋Ÿฌ๋‚˜ KV ์บ์‹œ๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง. + * **MQA (Multi-Query Attention)**: ๋ชจ๋“  Query ํ—ค๋“œ๊ฐ€ ๋‹จ ํ•˜๋‚˜์˜ Key/Value ํ—ค๋“œ๋ฅผ ๊ณต์œ  $\rightarrow$ ๋งค์šฐ ๋น ๋ฅด์ง€๋งŒ ์„ฑ๋Šฅ(ํ’ˆ์งˆ) ์ €ํ•˜ ๋ฐœ์ƒ. +2. **ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜**: + * **๊ทธ๋ฃนํ™” (Grouping)**: ์—ฌ๋Ÿฌ ๊ฐœ์˜ Query ํ—ค๋“œ๋ฅผ ํ•˜๋‚˜์˜ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ณ , ๊ฐ ๊ทธ๋ฃน๋งˆ๋‹ค ํ•˜๋‚˜์˜ Key/Value ํ—ค๋“œ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. + * **์ ˆ์ถฉ (Trade-off)**: MHA๋ณด๋‹ค๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ๊ณ , MQA๋ณด๋‹ค๋Š” ์ •๋ณด ๋ณด์กด ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚œ '์ค‘๊ฐ„ ์ง€์ '์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. +3. **์˜์˜**: + * Llama 2/3, Mistral ๋“ฑ ์ตœ์‹  ์˜คํ”ˆ์†Œ์Šค SOTA ๋ชจ๋ธ๋“ค์ด ์ฑ„ํƒํ•˜๊ณ  ์žˆ๋Š” ํ‘œ์ค€ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. + * ํŠนํžˆ ๊ธด ๋ฌธ๋งฅ(Long-context) ์ฒ˜๋ฆฌ ์‹œ KV ์บ์‹œ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” VRAM ๋น„์ค‘์„ ํš๊ธฐ์ ์œผ๋กœ ๋‚ฎ์ถฐ์ฃผ์–ด, ๋™์ผ ํ•˜๋“œ์›จ์–ด์—์„œ ๋” ํฐ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋‚˜ ๋” ๊ธด ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **์„ฑ๋Šฅ/ํšจ์œจ ๋น„๋ก€**: ๊ทธ๋ฃน ์ˆ˜($G$)๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก MHA์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉฐ ์„ฑ๋Šฅ์€ ์ข‹์•„์ง€์ง€๋งŒ KV ์บ์‹œ๊ฐ€ ์ปค์ง€๊ณ , ์ค„์ผ์ˆ˜๋ก MQA์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉฐ ํšจ์œจ์€ ์ข‹์•„์ง€์ง€๋งŒ ํ’ˆ์งˆ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. +* **๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๊ณ ์ •**: ํ•™์Šต ์‹œ์— ๊ทธ๋ฃน ๊ตฌ์กฐ๋ฅผ ๊ฒฐ์ •ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ๊ธฐ์กด MHA ๋ชจ๋ธ์„ ์ถ”๋ก  ์‹œ์—๋งŒ GQA๋กœ ์ „ํ™˜ํ•˜๋Š” ๊ฒƒ์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฉฐ ์ถ”๊ฐ€์ ์ธ ์—…์‚ฌ์ดํด๋ง(Upcycling) ํ•™์Šต์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] +* **๋Œ€์กฐ ๊ธฐ์ˆ **: [[Multi-Head Attention (MHA)|Multi-Head Attention (MHA)]], [[Multi-Query Attention (MQA)|Multi-Query Attention (MQA)]] +* **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[KV Cache|KV Cache]], [[PagedAttention|PagedAttention]], [[Flash Attention|Flash Attention]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/KV Cache Compression.md b/10_Wiki/Topics/AI_and_ML/KV Cache Compression.md new file mode 100644 index 00000000..7326eb26 --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/KV Cache Compression.md @@ -0,0 +1,36 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-KVCP-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, kv-cache-compression, attention-optimization, thin-kv, eviction-policy] +last_reinforced: 2026-05-04 +--- + +# [[KV Cache Compression|KV Cache Compression]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "๊ธฐ์–ต์˜ ๋‹ค์ด์–ดํŠธ: ๋ชจ๋“  ์ •๋ณด๋ฅผ ๋ฌด์ž‘์ • ๋“ค๊ณ  ์žˆ๋Š” ๋Œ€์‹ , ๋งฅ๋ฝ์— ๋œ ์ค‘์š”ํ•œ ํ† ํฐ์„ ์„ ๋ณ„์ ์œผ๋กœ ์‚ญ์ œํ•˜๊ฑฐ๋‚˜ ์••์ถ•ํ•จ์œผ๋กœ์จ ํ•œ์ •๋œ VRAM ์•ˆ์—์„œ ๋ฌดํ•œ์— ๊ฐ€๊นŒ์šด ๋ฌธ๋งฅ์„ ์ˆ˜์šฉํ•˜๋ ค๋Š” ๊ณ ๋„์˜ ์ตœ์ ํ™” ์ „๋žต." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +KV ์บ์‹œ ์••์ถ•(KV Cache Compression)์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์—ฌ ๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋†’์ด๊ธฐ ์œ„ํ•ด, ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ KV ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ์š”์•ฝํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. + +1. **์ฃผ์š” ์ „๋žต**: + * **์ถ•์ถœ (Eviction)**: ์–ดํ…์…˜ ์ ์ˆ˜๊ฐ€ ๋‚ฎ๊ฑฐ๋‚˜ ์ •๋ณด ๊ฐ€์น˜๊ฐ€ ์ ์€ ํ† ํฐ์˜ K, V ๊ฐ’์„ ์บ์‹œ์—์„œ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: StreamingLLM, H2O) + * **๋ณ‘ํ•ฉ (Merging/Pooling)**: ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ํ† ํฐ์˜ KV ๊ฐ’์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์„œ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. + * **๋™์  ์„ ํƒ**: ์ถ”๋ก  ์‹œ ๋ชจ๋ธ์ด ์Šค์Šค๋กœ ์–ด๋–ค ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ•˜๊ณ  ์–ด๋–ค ์ •๋ณด๋ฅผ ์žŠ์„์ง€ ๊ฒฐ์ •ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. +2. **ThinKV (์ตœ์‹  ์‚ฌ๋ก€)**: + * ๋…ผ๋ฆฌ์  '์ƒ๊ฐ(Thought)'์˜ ์ค‘์š”๋„์— ๋”ฐ๋ผ ๋œ ์ค‘์š”ํ•œ KV ์บ์‹œ ํ† ํฐ์„ ์„ ์ œ์ ์œผ๋กœ ๋น„์šฐ๊ณ , ๋ณ„๋„์˜ ์••์ถ• ์˜ค๋ฒ„ํ—ค๋“œ ์—†์ด ๋ฉ”๋ชจ๋ฆฌ ์Šฌ๋กฏ์„ ์ œ์ž๋ฆฌ์—์„œ ์žฌ์‚ฌ์šฉ(In-place reuse)ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์••์ถ• ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. +3. **์žฅ์ **: + * ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ๋ฅผ 50%~90% ์ด์ƒ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + * ํ•˜๋“œ์›จ์–ด ์ฆ์„ค ์—†์ด ์†Œํ”„ํŠธ์›จ์–ด๋งŒ์œผ๋กœ ๋” ๊ธด ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ๋ฅผ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **์ •ํ™•๋„ ์†์‹ค**: ์ค‘์š”ํ•œ ํ† ํฐ์ด ์ถ•์ถœ๋  ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋…ผ๋ฆฌ๊ฐ€ ๊นจ์ง€๊ฑฐ๋‚˜ ํ™˜๊ฐ(Hallucination)์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +* **์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ**: ์–ด๋–ค ํ† ํฐ์„ ๋ฒ„๋ฆด์ง€ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ • ์ž์ฒด๊ฐ€ ์ถ”๊ฐ€์ ์ธ ์ง€์—ฐ ์‹œ๊ฐ„(Latency)์„ ๋ฐœ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Key-Value (KV) Cache|Key-Value (KV) Cache]] +* **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[Sparse Attention|Sparse Attention]], [[KV Cache Quantization|KV Cache Quantization]], [[ThinKV|ThinKV]], [[StreamingLLM|StreamingLLM]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/Key-Value (KV) Cache.md b/10_Wiki/Topics/AI_and_ML/Key-Value (KV) Cache.md new file mode 100644 index 00000000..12bd4b79 --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/Key-Value (KV) Cache.md @@ -0,0 +1,37 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-KVCH-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, kv-cache, transformer-inference, memory-bottleneck, llm-performance] +last_reinforced: 2026-05-04 +--- + +# [[Key-Value (KV) Cache|Key-Value (KV) Cache]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "๋ชจ๋ธ์˜ ๋‹จ๊ธฐ ๊ธฐ์–ต ์žฅ์น˜: ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ถ”๋ก  ๊ณผ์ •์—์„œ ์ด์ „ ํ† ํฐ๋“ค์˜ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ(Key, Value)๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘๊ณ  ์žฌ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ๋งค๋ฒˆ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋Š” ๋‚ญ๋น„๋ฅผ ์—†์• ๊ณ  ์ƒ์„ฑ ์†๋„๋ฅผ ๋น„์•ฝ์ ์œผ๋กœ ๋†’์ธ ์ถ”๋ก  ์ตœ์ ํ™”์˜ ์‹ฌ์žฅ." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +KV ์บ์‹œ(Key-Value Cache)๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ, ์ด๋ฏธ ์ฒ˜๋ฆฌ๋œ ํ† ํฐ๋“ค์˜ Key์™€ Value ํ–‰๋ ฌ์„ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ž๊ธฐํšŒ๊ท€(Autoregressive) ์ƒ์„ฑ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ค‘๋ณต ์—ฐ์‚ฐ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. + +1. **ํ•„์š”์„ฑ**: + * ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•  ๋•Œ ์ด์ „์˜ ๋ชจ๋“  ํ† ํฐ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. + * KV ์บ์‹œ๊ฐ€ ์—†๋‹ค๋ฉด $n$๋ฒˆ์งธ ํ† ํฐ์„ ์ƒ์„ฑํ•  ๋•Œ $1$๋ถ€ํ„ฐ $n-1$๊นŒ์ง€์˜ ํ† ํฐ์„ ๋งค๋ฒˆ ๋‹ค์‹œ ์—ฐ์‚ฐํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ์‹œํ€€์Šค๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. +2. **์ž‘๋™ ์›๋ฆฌ**: + * **Prefill ๋‹จ๊ณ„**: ์ž…๋ ฅ๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋ฉฐ ๋ชจ๋“  ํ† ํฐ์˜ K, V ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ ์บ์‹œ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. + * **Decoding ๋‹จ๊ณ„**: ์ƒˆ๋กœ์šด ํ† ํฐ์„ ํ•˜๋‚˜์”ฉ ์ƒ์„ฑํ•  ๋•Œ๋งˆ๋‹ค ํ•ด๋‹น ํ† ํฐ์˜ K, V ๊ฐ’๋งŒ ๊ณ„์‚ฐํ•˜์—ฌ ์บ์‹œ์— ์ถ”๊ฐ€ํ•˜๊ณ , ์ด์ „ ๊ฐ’๋“ค์€ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. +3. **๋ณ‘๋ชฉ ํ˜„์ƒ**: + * **๋ฉ”๋ชจ๋ฆฌ ์••๋ฐ•**: ์ปจํ…์ŠคํŠธ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก KV ์บ์‹œ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” VRAM ์šฉ๋Ÿ‰์ด ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: ์ˆ˜์ฒœ ๋ช…์˜ ์‚ฌ์šฉ์ž๊ฐ€ ๋™์‹œ์— ๊ธด ๋Œ€ํ™”๋ฅผ ๋‚˜๋ˆŒ ๊ฒฝ์šฐ GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ(OOM) ๋ฐœ์ƒ ์›์ธ 1์ˆœ์œ„) + * **I/O ๋ณ‘๋ชฉ**: ์—ฐ์‚ฐ ์ž์ฒด๋ณด๋‹ค ์บ์‹œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ฝ์–ด์˜ค๋Š” ์†๋„(Memory Bandwidth)๊ฐ€ ์ถ”๋ก  ์†๋„๋ฅผ ๊ฒฐ์ •ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **์šฉ๋Ÿ‰ vs ์†๋„**: ์บ์‹œ๋ฅผ ๋งŽ์ด ํ•˜๋ฉด ์†๋„๋Š” ๋นจ๋ผ์ง€์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•ด์ง€๊ณ , ์บ์‹œ๋ฅผ ์ค„์ด๋ฉด(Compression/Quantization) ๋” ๊ธด ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ •ํ™•๋„๊ฐ€ ์†Œํญ ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +* **๋‹จํŽธํ™” ๋ฌธ์ œ**: ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ฏธ๋ฆฌ ํ• ๋‹นํ•  ๊ฒฝ์šฐ, ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๋นˆ ๊ณต๊ฐ„์ด ๋ฐœ์ƒํ•˜๋Š” '๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™”' ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด [[PagedAttention|PagedAttention]]์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] +* **์ตœ์ ํ™” ๊ธฐ์ˆ **: [[PagedAttention|PagedAttention]], [[KV Cache Compression|KV Cache Compression]], [[KV Cache Quantization|KV Cache Quantization]], [[Grouped-Query Attention (GQA)|GQA]] +* **ํ”„๋ ˆ์ž„์›Œํฌ**: [[vLLM|vLLM]], [[TensorRT-LLM|TensorRT-LLM]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/PagedAttention.md b/10_Wiki/Topics/AI_and_ML/PagedAttention.md new file mode 100644 index 00000000..dd0c7fa6 --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/PagedAttention.md @@ -0,0 +1,36 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-PATT-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, paged-attention, vllm, kv-cache, memory-management, fragmentation] +last_reinforced: 2026-05-04 +--- + +# [[PagedAttention|PagedAttention]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "OS์˜ ์ง€ํ˜œ๋ฅผ AI๋กœ: ์šด์˜์ฒด์ œ์˜ ๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌ ํŽ˜์ด์ง• ๊ธฐ๋ฒ•์„ KV ์บ์‹œ ๊ด€๋ฆฌ์— ๋„์ž…ํ•˜์—ฌ, ๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™”๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ํ™œ์šฉ๋ฅ ์„ 96% ์ด์ƒ์œผ๋กœ ๋Œ์–ด์˜ฌ๋ฆฐ ์ถ”๋ก  ์—”์ง„์˜ ํ˜๋ช…." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +PagedAttention์€ LLM ์ถ”๋ก  ์‹œ KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ๊ธฐ์ˆ ๋กœ, ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋Œ€์‹  ๋น„์—ฐ์†์ ์ธ ๋ธ”๋ก(Block) ๋‹จ์œ„ ํ• ๋‹น ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. + +1. **ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜**: + * **๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌ ํŽ˜์ด์ง•**: KV ์บ์‹œ๋ฅผ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ '๋…ผ๋ฆฌ์  ๋ธ”๋ก'์œผ๋กœ ๋‚˜๋ˆ„๊ณ , ์ด๋ฅผ ์‹ค์ œ '๋ฌผ๋ฆฌ์  ๋ธ”๋ก'์— ๋™์ ์œผ๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค. + * **๋ธ”๋ก ํ…Œ์ด๋ธ” (Block Table)**: ๋…ผ๋ฆฌ์  ๋ธ”๋ก๊ณผ ๋ฌผ๋ฆฌ์  ๋ธ”๋ก ์‚ฌ์ด์˜ ๋งคํ•‘ ์ •๋ณด๋ฅผ ์ €์žฅํ•˜์—ฌ, ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๋–จ์–ด์ ธ ์žˆ์–ด๋„ ๋…ผ๋ฆฌ์ ์œผ๋กœ๋Š” ์—ฐ์†๋œ ๊ฒƒ์ฒ˜๋Ÿผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. +2. **์ฃผ์š” ์žฅ์ **: + * **๋‹จํŽธํ™” ์ œ๊ฑฐ**: ๋ฏธ๋ฆฌ ๊ฑฐ๋Œ€ํ•œ ๊ณต๊ฐ„์„ ์˜ˆ์•ฝํ•  ํ•„์š”๊ฐ€ ์—†์–ด ๋‚ด๋ถ€ ๋‹จํŽธํ™”๊ฐ€ ๊ฑฐ์˜ ๋ฐœ์ƒํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. + * **๋ฉ”๋ชจ๋ฆฌ ๊ณต์œ **: ๋™์ผํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ณต์œ ํ•˜๋Š” ์—ฌ๋Ÿฌ ์š”์ฒญ(์˜ˆ: Parallel Sampling)์ด ์žˆ์„ ๋•Œ, ๊ณตํ†ต๋œ KV ๋ธ”๋ก์„ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ํ•œ ๋ฒˆ๋งŒ ์ €์žฅํ•˜๊ณ  ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (Copy-on-Write ๋ฐฉ์‹). +3. **์„ฑ๋Šฅ ํ–ฅ์ƒ**: + * ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ ์ฆ๊ฐ€๋Š” ๊ณง ๋™์ผํ•œ GPU ์ž์›์—์„œ ํ›จ์”ฌ ๋” ๋งŽ์€ ๋™์‹œ ์š”์ฒญ(Throughput)์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **๋ณต์žกํ•œ ์ปค๋„ ๊ตฌํ˜„**: ๋น„์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก์„ ๋น ๋ฅด๊ฒŒ ์ฝ๊ณ  ์“ฐ๋Š” ์ „์šฉ CUDA ์ปค๋„์ด ํ•„์š”ํ•˜์—ฌ ๊ตฌํ˜„ ๋‚œ์ด๋„๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค. +* **๋ธ”๋ก ํฌ๊ธฐ ๋ฏผ๊ฐ๋„**: ๋ธ”๋ก ํฌ๊ธฐ(์˜ˆ: 8, 16 ํ† ํฐ) ์„ค์ •์— ๋”ฐ๋ผ GPU ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ํšจ์œจ์„ฑ๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์˜ค๋ฒ„ํ—ค๋“œ ์‚ฌ์ด์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Key-Value (KV) Cache|Key-Value (KV) Cache]], [[Virtual Memory Paging|Virtual Memory Paging]] +* **๋Œ€ํ‘œ ํ”„๋ ˆ์ž„์›Œํฌ**: [[vLLM|vLLM]], [[TensorRT-LLM|TensorRT-LLM]] +* **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[KV Cache Compression|KV Cache Compression]], [[ThinKV|ThinKV]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/Retrieval-Augmented Generation (RAG).md b/10_Wiki/Topics/AI_and_ML/Retrieval-Augmented Generation (RAG).md new file mode 100644 index 00000000..e21ee77b --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/Retrieval-Augmented Generation (RAG).md @@ -0,0 +1,41 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-RAGM-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, rag, retrieval-augmented-generation, knowledge-base, llm-context] +last_reinforced: 2026-05-04 +--- + +# [[Retrieval-Augmented Generation (RAG)|Retrieval-Augmented Generation (RAG)]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "์˜คํ”ˆ ๋ถ ํ…Œ์ŠคํŠธ์˜ ์ •์„: ๋ชจ๋“  ์ง€์‹์„ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์šฐ๊ฒจ๋„ฃ๋Š” ๋Œ€์‹ , ํ•„์š”ํ•  ๋•Œ๋งˆ๋‹ค ์™ธ๋ถ€ ์ง€์‹ ์ฐฝ๊ณ ์—์„œ ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ฐพ์•„ ๋ชจ๋ธ์—๊ฒŒ ์ „๋‹ฌํ•จ์œผ๋กœ์จ ์ •ํ™•๋„๋ฅผ ๋†’์ด๊ณ  ํ™˜๊ฐ์„ ์ค„์ด๋Š” ์‹ค์šฉ์ฃผ์˜์  AI ์•„ํ‚คํ…์ฒ˜." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +RAG(๊ฒ€์ƒ‰ ์ฆ๊ฐ• ์ƒ์„ฑ)๋Š” ๊ฑฐ๋Œ€ ์–ธ์–ด ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋Š” ์ตœ์‹  ์ •๋ณด๋‚˜ ํŠน์ • ๋„๋ฉ”์ธ์˜ ์ง€์‹์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก, ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ๊ด€๋ จ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜์—ฌ ํ”„๋กฌํ”„ํŠธ์— ํฌํ•จ์‹œํ‚ค๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. + +1. **์ž‘๋™ ํ”„๋กœ์„ธ์Šค**: + * **Indexing (์ธ๋ฑ์‹ฑ)**: ๋ฐฉ๋Œ€ํ•œ ๋ฌธ์„œ๋ฅผ ์ž‘์€ ์กฐ๊ฐ(Chunk)์œผ๋กœ ๋‚˜๋ˆ„๊ณ  ๋ฒกํ„ฐ(Vector) ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. + * **Retrieval (๊ฒ€์ƒ‰)**: ์‚ฌ์šฉ์ž์˜ ์งˆ๋ฌธ๊ณผ ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋ฌธ์„œ ์กฐ๊ฐ๋“ค์„ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค. + * **Generation (์ƒ์„ฑ)**: ๊ฒ€์ƒ‰๋œ ์กฐ๊ฐ๋“ค์„ ์งˆ๋ฌธ๊ณผ ํ•จ๊ป˜ ๋ชจ๋ธ์—๊ฒŒ ์ „๋‹ฌํ•˜์—ฌ, ํ•ด๋‹น ๊ทผ๊ฑฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. +2. **ํ•ต์‹ฌ ์ด์ **: + * **ํ™˜๊ฐ(Hallucination) ๊ฐ์†Œ**: ๋ชจ๋ธ์ด ๊ทผ๊ฑฐ ๋ฌธ์„œ๋ฅผ ๋ณด๊ณ  ๋‹ต๋ณ€ํ•˜๋ฏ€๋กœ ์—†๋Š” ์‚ฌ์‹ค์„ ์ง€์–ด๋‚ผ ํ™•๋ฅ ์ด ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. + * **์ตœ์‹ ์„ฑ ์œ ์ง€**: ๋ชจ๋ธ์„ ์žฌํ•™์Šต์‹œํ‚ค์ง€ ์•Š๊ณ ๋„ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋งŒ ์—…๋ฐ์ดํŠธํ•˜๋ฉด ์ตœ์‹  ์ง€์‹์„ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + * **์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ**: ๋‹ต๋ณ€์˜ ์ถœ์ฒ˜(Source/Citation)๋ฅผ ๋ช…ํ™•ํžˆ ์ œ์‹œํ•  ์ˆ˜ ์žˆ์–ด ์‹ ๋ขฐ๋„๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค. +3. **๋ฐœ์ „ ๋‹จ๊ณ„**: + * **Naive RAG**: ๋‹จ์ˆœ ๋ฒกํ„ฐ ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜. + * **Advanced RAG**: ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ฒ€์ƒ‰, ์žฌ์ˆœ์œ„ํ™”(Re-ranking), ์ฟผ๋ฆฌ ๋ณ€ํ™˜ ๋“ฑ์„ ํฌํ•จ. + * **[[Agentic RAG|Agentic RAG]]**: ์—์ด์ „ํŠธ๊ฐ€ ์Šค์Šค๋กœ ๊ฒ€์ƒ‰ ์ „๋žต์„ ์ˆ˜๋ฆฝํ•˜๊ณ  ๊ฒฐ๊ณผ์˜ ์ ์ ˆ์„ฑ์„ ํ‰๊ฐ€ํ•˜๋ฉฐ ๋ฃจํ”„๋ฅผ ์ˆ˜ํ–‰. + +## โš–๏ธ Trade-offs & Caveats +* **๊ฒ€์ƒ‰ ์˜์กด์„ฑ**: ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๊ฐ€ ๋ถ€์‹คํ•˜๋ฉด ๋‹ต๋ณ€ ํ’ˆ์งˆ๋„ ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. (Garbage In, Garbage Out) +* **์ง€์—ฐ ์‹œ๊ฐ„**: ์™ธ๋ถ€ ๊ฒ€์ƒ‰ ๋‹จ๊ณ„๊ฐ€ ์ถ”๊ฐ€๋˜๋ฏ€๋กœ ์ˆœ์ˆ˜ ์ƒ์„ฑ๋ณด๋‹ค ์‘๋‹ต ์†๋„๊ฐ€ ๋А๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +* **Lost in the middle**: ๋„ˆ๋ฌด ๋งŽ์€ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•˜์—ฌ ์ „๋‹ฌํ•  ๊ฒฝ์šฐ, ๋ชจ๋ธ์ด ์ปจํ…์ŠคํŠธ ์ค‘๊ฐ„์— ์žˆ๋Š” ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ๋†“์น˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[LLM Application Architecture|LLM Application Architecture]] +* **์„ธ๋ถ€ ๊ธฐ์ˆ **: [[Agentic RAG|Agentic RAG]], [[GraphRAG|GraphRAG]], [[Hybrid Search|Hybrid Search]], [[Re-ranking|Re-ranking]] +* **์ตœ์ ํ™” ๋„๊ตฌ**: [[LlamaIndex|LlamaIndex]], [[LangChain|LangChain]], [[ChromaDB|ChromaDB]], [[Pinecone|Pinecone]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/Ring Attention.md b/10_Wiki/Topics/AI_and_ML/Ring Attention.md new file mode 100644 index 00000000..f81ce1a1 --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/Ring Attention.md @@ -0,0 +1,37 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-RATT-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, ring-attention, context-parallelism, distributed-training, ultra-long-context] +last_reinforced: 2026-05-04 +--- + +# [[Ring Attention|Ring Attention]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "๋ฌดํ•œ์„ ํ–ฅํ•œ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ: ๋‹จ์ผ GPU์˜ ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„๋ฅผ ๋„˜์–ด, ์—ฌ๋Ÿฌ ์žฅ์น˜๋ฅผ ๋ง(Ring) ํ˜•ํƒœ๋กœ ์—ฐ๊ฒฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœํ™˜์‹œํ‚ค๋ฉฐ ์–ดํ…์…˜์„ ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ, ์ด๋ก ์ ์œผ๋กœ ๋ฌดํ•œ๋Œ€์— ๊ฐ€๊นŒ์šด '์ดˆ๊ฑฐ๋Œ€ ์ปจํ…์ŠคํŠธ' ํ™•์žฅ์„ ์‹คํ˜„ํ•˜๋Š” ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ์˜ ํ˜์‹ ." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +Ring Attention์€ ์—ฌ๋Ÿฌ GPU ๋˜๋Š” ๊ฐ€์†๊ธฐ ์žฅ์น˜์— ๊ฑธ์ณ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌํ•จ์œผ๋กœ์จ, ๋‹จ์ผ ์žฅ์น˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ดˆ๊ณผํ•˜๋Š” ์ดˆ์žฅ๊ฑฐ๋ฆฌ ๋ฌธ๋งฅ(Ultra-long context)์„ ํ•™์Šตํ•˜๊ณ  ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. + +1. **ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ (Context Parallelism)**: + * **์‹œํ€€์Šค ๋ถ„ํ• **: ์ž…๋ ฅ ๋ฌธ์žฅ์„ $N$๊ฐœ์˜ ์กฐ๊ฐ์œผ๋กœ ๋‚˜๋ˆ„์–ด $N$๊ฐœ์˜ GPU์— ๋ถ„์‚ฐ ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค. + * **๋ง ํ†ต์‹  (Ring Communication)**: ๊ฐ GPU๋Š” ์ž์‹ ์ด ๊ฐ€์ง„ Query๋ฅผ ๊ณ ์ •ํ•˜๊ณ , ๋‹ค๋ฅธ GPU๋“ค์ด ๊ฐ€์ง„ Key/Value ๋ธ”๋ก์„ ๋ง ํ˜•ํƒœ๋กœ ์ „๋‹ฌ๋ฐ›์•„ ์ˆœ์ฐจ์ ์œผ๋กœ ์–ดํ…์…˜์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. + * **๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ**: ๋‹ค์Œ KV ๋ธ”๋ก์„ ๋ฏธ๋ฆฌ ๋ฐ›์•„์˜ค๋Š” ํ†ต์‹ ๊ณผ ํ˜„์žฌ ๋ธ”๋ก์˜ ์—ฐ์‚ฐ์„ ๊ฒน์ณ์„œ ์ˆ˜ํ–‰(Overlap)ํ•จ์œผ๋กœ์จ ํ†ต์‹  ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. +2. **์ฃผ์š” ํŠน์ง•**: + * **ํ™•์žฅ์„ฑ**: ์žฅ์น˜ ์ˆ˜($N$)๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ์ปจํ…์ŠคํŠธ ๊ธธ์ด๊ฐ€ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: 1M, 10M ํ† ํฐ ์ด์ƒ). + * **์ •ํ™•๋„**: ๊ทผ์‚ฌ์น˜๊ฐ€ ์•„๋‹Œ Full-Attention์„ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ์ •ํ™•ํ•˜๊ฒŒ ๊ณ„์‚ฐํ•ด๋ƒ…๋‹ˆ๋‹ค. +3. **์˜์˜**: + * ์ตœ๊ทผ์˜ '๋ฐฑ๋งŒ ํ† ํฐ ์ปจํ…์ŠคํŠธ' ๊ฒฝ์Ÿ(Gemini, Claude ๋“ฑ)์„ ๋’ท๋ฐ›์นจํ•˜๋Š” ํ•ต์‹ฌ ์ธํ”„๋ผ ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜๋กœ ํ‰๊ฐ€๋ฐ›์Šต๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ**: ์žฅ์น˜ ๊ฐ„ ๋ฐ์ดํ„ฐ ์ „์†ก(P2P Communication) ์†๋„๊ฐ€ ์ „์ฒด ์„ฑ๋Šฅ์˜ ๋ณ‘๋ชฉ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ NVLink์™€ ๊ฐ™์€ ๊ณ ์† ์ธํ„ฐ์ปค๋„ฅํŠธ๊ฐ€ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. +* **FlashAttention๊ณผ์˜ ์ƒ์ถฉ**: ๋ถ„ํ• ๋œ ๋ธ”๋ก ๋‹จ์œ„๋กœ FlashAttention์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ํšจ์œจ์„ฑ ์ €ํ•˜๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด, ํ†ต์‹  ํŒจํ„ด์„ ๊ทน๋„๋กœ ์ •๋ฐ€ํ•˜๊ฒŒ ์„ค๊ณ„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: USP ์ „๋žต). + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[Distributed Training|Distributed Training]] +* **๋น„๊ต/๋ณด์™„ ๊ธฐ์ˆ **: [[Flash Attention|Flash Attention]], [[Sparse Attention|Sparse Attention]] +* **์‘์šฉ ๋ถ„์•ผ**: 100๋งŒ ํ† ํฐ ์ด์ƒ ์žฅ๊ฑฐ๋ฆฌ ๋ฌธ๋งฅ ๋ชจ๋ธ๋ง, ๋ณต์žกํ•œ ์ฝ”๋“œ๋ฒ ์ด์Šค ์ „์ฒด ๋ถ„์„ + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/Sparse Attention.md b/10_Wiki/Topics/AI_and_ML/Sparse Attention.md new file mode 100644 index 00000000..a592f9dc --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/Sparse Attention.md @@ -0,0 +1,38 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-SATT-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, sparse-attention, dsa, attention-complexity, efficiency, deepseek] +last_reinforced: 2026-05-04 +--- + +# [[Sparse Attention|Sparse Attention]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "์ง€๋Šฅ์˜ ์„ ํƒ๊ณผ ์ง‘์ค‘: ๋ชจ๋“  ํ† ํฐ์„ ์ „๋ถ€ ๋น„๊ตํ•˜๋Š” ๋‚ญ๋น„๋ฅผ ๋ฒ„๋ฆฌ๊ณ , ๋งฅ๋ฝ์ƒ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํ•ต์‹ฌ ํ† ํฐ๋“ค๋งŒ ๊ณจ๋ผ๋‚ด๋Š” 'ํฌ์†Œํ•œ ์—ฐ๊ฒฐ'์„ ํ†ตํ•ด ์—ฐ์‚ฐ ๋ณต์žก๋„๋ฅผ $O(n^2)$์—์„œ $O(n)$ ์ˆ˜์ค€์œผ๋กœ ๋‚ฎ์ถ˜ ํšจ์œจ์  ์ง€๋Šฅ์˜ ํ‘œ๋ณธ." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +Sparse Attention์€ ๋ชจ๋“  ํ† ํฐ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋Œ€์‹ , ํŠน์ • ํŒจํ„ด์ด๋‚˜ ์ค‘์š”๋„์— ๋”ฐ๋ผ ์ผ๋ถ€ ํ† ํฐ๋“ค๋งŒ ์„ ํƒ์ ์œผ๋กœ ์ฐธ์กฐํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋น„์šฉ์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ด๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. + +1. **๊ธฐ๋ณธ ํŒจํ„ด**: + * **Sliding Window**: ์ธ์ ‘ํ•œ ํ† ํฐ๋“ค(๋กœ์ปฌ ๋ฌธ๋งฅ)์—๋งŒ ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. + * **Global Tokens**: ์ค‘์š”ํ•œ ์œ„์น˜(๋ฌธ์žฅ ์‹œ์ž‘ ๋“ฑ)์˜ ํ† ํฐ์„ ์ „์ฒด๊ฐ€ ๊ณต์œ ํ•˜์—ฌ ์กฐ๋งํ•ฉ๋‹ˆ๋‹ค. + * **Random/Fixed Patterns**: ์‚ฌ์ „์— ์ •์˜๋œ ๊ทœ์น™์ด๋‚˜ ๋ฌด์ž‘์œ„ ์—ฐ๊ฒฐ์„ ํ†ตํ•ด ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค. +2. **DSA (DeepSeek Sparse Attention)**: + * **Indexer-Selector ๋ฉ”์ปค๋‹ˆ์ฆ˜**: ๋‹จ์ˆœํžˆ ๊ณ ์ •๋œ ์œ„์น˜๋ฅผ ๋ณด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, '์ธ๋ฑ์„œ'๊ฐ€ ๊ด€๋ จ ์žˆ๋Š” ํ† ํฐ์„ ๋จผ์ € ์ฐพ๊ณ  '์…€๋ ‰ํ„ฐ'๊ฐ€ ๊ทธ ํ•˜์œ„ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด์„œ๋งŒ ์–ดํ…์…˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. + * **์˜์˜**: ์ •ํ™•๋„ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ 100๋งŒ ํ† ํฐ ์ด์ƒ์˜ ์ดˆ์žฅ๊ฑฐ๋ฆฌ ์ปจํ…์ŠคํŠธ๋ฅผ ์Šค์ผ€์ผ๋งํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. +3. **์žฅ์ **: + * ์‹œํ€€์Šค ๊ธธ์ด์— ๋”ฐ๋ฅธ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋ฅผ ์„ ํ˜•($O(n)$)์œผ๋กœ ์–ต์ œํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค. + * KV ์บ์‹œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์••๋ฐ•์„ ์ค„์—ฌ ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **์ •๋ณด ์†์‹ค ์œ„ํ—˜**: ์ค‘์š”ํ•œ ํ† ํฐ์„ ๋†“์น  ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(Lost in the middle ํ˜„์ƒ ๋“ฑ). ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ •๊ตํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•„ํ‚คํ…์ฒ˜(์˜ˆ: Gemma 4์˜ Local-Global ๊ต์ฐจ ๋ฐฉ์‹)๊ฐ€ ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. +* **๊ตฌํ˜„ ๋ณต์žก์„ฑ**: ํ‘œ์ค€ Dense Attention์— ๋น„ํ•ด ์ธ๋ฑ์‹ฑ, ์„ ํƒ ๋กœ์ง ๋“ฑ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋ณต์žกํ•˜์—ฌ ์‹œ์Šคํ…œ ํ†ตํ•ฉ ๋ฐ ์ตœ์ ํ™”์— ๋†’์€ ๊ธฐ์ˆ ๋ ฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] +* **๋น„๊ต ๊ธฐ์ˆ **: [[Flash Attention|Flash Attention]] (I/O ์ตœ์ ํ™” vs ์—ฐ์‚ฐ ํšŸ์ˆ˜ ์ตœ์ ํ™”) +* **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[Sliding Window Attention|Sliding Window Attention]], [[Mixture of Experts (MoE)|Mixture of Experts (MoE)]], [[KV Cache|KV Cache]] + +--- +*Last updated: 2026-05-04* diff --git a/10_Wiki/Topics/AI_and_ML/vLLM.md b/10_Wiki/Topics/AI_and_ML/vLLM.md new file mode 100644 index 00000000..5c8dec9f --- /dev/null +++ b/10_Wiki/Topics/AI_and_ML/vLLM.md @@ -0,0 +1,36 @@ +--- +id: [[P-Reinforce|P-Reinforce]]-AUTO-VLLM-001 +category: Unified +confidence_score: 1.00 +tags: [auto-reinforced, vllm, llm-serving, throughput-optimization, paged-attention] +last_reinforced: 2026-05-04 +--- + +# [[vLLM|vLLM]] + +## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) +> "์„œ๋น™ ์„ฑ๋Šฅ์˜ ๊ฒŒ์ž„ ์ฒด์ธ์ €: PagedAttention์„ ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ ๋„์ž…ํ•˜์—ฌ, ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ 10~20๋ฐฐ ์ด์ƒ์˜ ๋™์‹œ ์ฒ˜๋ฆฌ๋Ÿ‰(Throughput)์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ LLM ์‹ค์šฉ ์„œ๋น„์Šค ์‹œ๋Œ€๋ฅผ ์•ž๋‹น๊ธด ํ‘œ์ค€ ์ถ”๋ก  ์—”์ง„." + +## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) +vLLM(Virtual Large Language Model)์€ ๊ณ ์„ฑ๋Šฅ LLM ์ถ”๋ก  ๋ฐ ์„œ๋น™์„ ์œ„ํ•ด ์„ค๊ณ„๋œ ์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ๊ณผ ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ทน๋Œ€ํ™”์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. + +1. **ํ•ต์‹ฌ ๊ธฐ์ˆ **: + * **[[PagedAttention|PagedAttention]]**: ๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ KV ์บ์‹œ ํ™œ์šฉ๋ฅ ์„ ํš๊ธฐ์ ์œผ๋กœ ๋†’์˜€์Šต๋‹ˆ๋‹ค. + * **Continuous Batching**: ๋ชจ๋“  ์š”์ฒญ์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฌ์ง€ ์•Š๊ณ , ๊ฐœ๋ณ„ ํ† ํฐ ์ƒ์„ฑ์ด ์™„๋ฃŒ๋  ๋•Œ๋งˆ๋‹ค ์ƒˆ๋กœ์šด ์š”์ฒญ์„ ๋ฐฐ์น˜์— ๋ผ์›Œ ๋„ฃ์–ด GPU ๊ฐ€๋™๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. +2. **์ฃผ์š” ํŠน์ง•**: + * **๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰**: Hugging Face Transformers๋‚˜ Text Generation Inference(TGI) ๋Œ€๋น„ ์›”๋“ฑํ•œ ์ฒ˜๋ฆฌ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. + * **๋ฒ”์šฉ์„ฑ**: Llama, Mistral, Gemma ๋“ฑ ๋Œ€๋ถ€๋ถ„์˜ ์ตœ์‹  ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ์„ ์ง€์›ํ•˜๋ฉฐ, OpenAI ํ˜ธํ™˜ API๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์—ฐ๋™์ด ์‰ฝ์Šต๋‹ˆ๋‹ค. +3. **์˜์˜**: + * ์ƒ์šฉ ์ˆ˜์ค€์˜ LLM ์„œ๋น„์Šค๋ฅผ ๊ตฌ์ถ•ํ•  ๋•Œ ๊ฐ€์žฅ ๋จผ์ € ๊ณ ๋ ค๋˜๋Š” ํ‘œ์ค€ ํ”„๋ ˆ์ž„์›Œํฌ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. + +## โš–๏ธ Trade-offs & Caveats +* **VRAM ์ ์œ **: ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๊ฐ€์šฉ VRAM์˜ ๋Œ€๋ถ€๋ถ„์„ KV ์บ์‹œ์šฉ์œผ๋กœ ์„ ์ (Pre-allocation)ํ•˜๋ฏ€๋กœ, ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค์™€ GPU๋ฅผ ๊ณต์œ ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. +* **TTFT vs Throughput**: ์ „์ฒด ์ฒ˜๋ฆฌ๋Ÿ‰์€ ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ๊ทน๋‹จ์ ์ธ ๋ฐฐ์น˜ ์ƒํ™ฉ์—์„œ๋Š” ์ฒซ ํ† ํฐ ์ƒ์„ฑ ์‹œ๊ฐ„(Time-to-First-Token)์ด ์†Œํญ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + +## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) +* **ํ•ต์‹ฌ ๊ธฐ๋ฐ˜**: [[PagedAttention|PagedAttention]], [[Key-Value (KV) Cache|Key-Value (KV) Cache]] +* **๊ฒฝ์Ÿ/๋Œ€์•ˆ ๊ธฐ์ˆ **: [[TensorRT-LLM|TensorRT-LLM]], [[TGI|TGI]], [[Ollama|Ollama]] +* **์ตœ์ ํ™” ๊ธฐ๋ฒ•**: [[Quantization|Quantization]], [[Speculative Decoding|Speculative Decoding]] + +--- +*Last updated: 2026-05-04*