--- id: wiki-2026-0508-llm-optimization-and-deployment- title: LLM Optimization and Deployment Strategies category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-CANONICAL-LLM-OPTIMIZATION] duplicate_of: none source_trust_level: A confidence_score: 0.92 tags: [canonical, llm-ops, quantization, distillation, peft, vllm, inference] raw_sources: [] last_reinforced: 2026-05-08 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) tech_stack: language: unspecified framework: unspecified --- # [[LLM_Optimization_and_Deployment_Strategies|LLM Optimization & Deployment Strategies]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ง€๋Šฅ์˜ ๋ฐ€๋„๋Š” ๋†’์ด๊ณ , ์‹คํ–‰์˜ ๋น„์šฉ์€ ๋‚ฎ์ถ”๋ผ." LLM ์ตœ์ ํ™”๋Š” ๊ฑฐ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์••์ถ•(์–‘์žํ™”, ์ฆ๋ฅ˜)ํ•˜๊ณ , ํ•™์Šต ํšจ์œจ์„ ๊ทน๋Œ€ํ™”(PEFT)ํ•˜๋ฉฐ, ์ถ”๋ก  ์—”์ง„(vLLM, PagedAttention)์„ ํ†ตํ•ด ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ตœ๋Œ€๋กœ ๋Œ์–ด์˜ฌ๋ ค ์‹ค์ „ ์„œ๋น„์Šค๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์œผ๋กœ ์ง€๋Šฅ์„ ์ •์ œํ•˜๋Š” ํ”„๋กœ์„ธ์Šค์ž…๋‹ˆ๋‹ค. --- ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ### 1. ๋ชจ๋ธ ์••์ถ• ๊ธฐ์ˆ  (Model Compression) * **Quantization (์–‘์žํ™”):** 32๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์ (FP32) ๊ฐ€์ค‘์น˜๋ฅผ 8๋น„ํŠธ(INT8) ๋˜๋Š” 4๋น„ํŠธ(INT4/NF4)๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ 70% ์ด์ƒ ์ ˆ๊ฐํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค. (GGUF, EXL2, AWQ ๋“ฑ) * **Knowledge Distillation (์ง€์‹ ์ฆ๋ฅ˜):** ๊ฑฐ๋Œ€ํ•œ ๊ต์‚ฌ(Teacher) ๋ชจ๋ธ์˜ ์ง€์‹์„ ์ž‘๊ณ  ๋น ๋ฅธ ํ•™์ƒ(Student) ๋ชจ๋ธ์—๊ฒŒ ์ „์ด์‹œ์ผœ, ์ž‘์€ ๋ชจ๋ธ๋กœ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. * **Pruning (๊ฐ€์ง€์น˜๊ธฐ):** ๋ชจ๋ธ์—์„œ ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ๋‰ด๋Ÿฐ์ด๋‚˜ ์—ฐ๊ฒฐ์„ ์ œ๊ฑฐํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ž…๋‹ˆ๋‹ค. ### 2. ํšจ์œจ์  ๋ฏธ์„ธ ์กฐ์ • (PEFT) * **LoRA (Low-Rank Adaptation):** ์ „์ฒด ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ์ •ํ•˜๊ณ  ๋งค์šฐ ์ž‘์€ ํฌ๊ธฐ์˜ ํ–‰๋ ฌ๋งŒ์„ ํ•™์Šต์‹œ์ผœ, ์ ์€ ๋ฆฌ์†Œ์Šค๋กœ๋„ ํŠน์ • ๋„๋ฉ”์ธ์— ํŠนํ™”๋œ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. * **QLoRA:** ์–‘์žํ™”๋œ ๋ชจ๋ธ ์œ„์— LoRA๋ฅผ ์ ์šฉํ•˜์—ฌ ๋‹จ์ผ ์†Œ๋น„์ž์šฉ GPU์—์„œ๋„ ์ˆ˜์‹ญ์–ต ํŒŒ๋ผ๋ฏธํ„ฐ ๋ชจ๋ธ์˜ ๋ฏธ์„ธ ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ### 3. LLM Ops ๋ฐ ์‹ค์ „ ๋ฐฐํฌ (LLM Ops & Deployment) * **vLLM & PagedAttention:** OS์˜ ๊ฐ€์ƒ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋ฐฉ์‹์—์„œ ์˜๊ฐ์„ ์–ป์–ด KV ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌ, ์ถ”๋ก  ์ฒ˜๋ฆฌ๋Ÿ‰(Throughput)์„ ์ˆ˜ ๋ฐฐ ์ด์ƒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. * **Speculative Decoding:** ์ž‘์€ ๋ณด์กฐ ๋ชจ๋ธ์ด ๋จผ์ € ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ณ  ํฐ ๋ชจ๋ธ์ด ์ด๋ฅผ ๊ฒ€์ฆํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐ€์†ํ™”ํ•ฉ๋‹ˆ๋‹ค. * **Continuous Monitoring & Evaluation:** ๋ชจ๋ธ์˜ ์‘๋‹ต ์†๋„, ํ† ํฐ ์‚ฌ์šฉ๋Ÿ‰, ๊ทธ๋ฆฌ๊ณ  ํ™˜๊ฐ(Hallucination) ์ง€ํ‘œ๋ฅผ ์‹ค์‹œ๊ฐ„ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ณ , Ragas๋‚˜ G-Eval๊ณผ ๊ฐ™์€ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ์ •๊ธฐ์ ์œผ๋กœ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. * **Data Drift ๊ฐ์ง€:** ์‚ฌ์šฉ์ž ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ๋ณ€ํ™”๋ฅผ ๊ฐ์ง€ํ•˜์—ฌ ๋ชจ๋ธ ์žฌํ•™์Šต์ด๋‚˜ ํ”„๋กฌํ”„ํŠธ ์กฐ์ • ์‹œ์ ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. * **Local Deployment:** Ollama, LM Studio ๋“ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๋กœ์ปฌ ํ™˜๊ฒฝ(Mac M ์‹œ๋ฆฌ์ฆˆ, Mini PC ๋“ฑ)์—์„œ ํ”„๋ผ์ด๋ฒ„์‹œ๋ฅผ ๋ณดํ˜ธํ•˜๋ฉฐ LLM์„ ๊ตฌ๋™ํ•ฉ๋‹ˆ๋‹ค. --- ## โš–๏ธ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ๋ฐ ์ฃผ์˜์‚ฌํ•ญ (Trade-offs) * **์ •๋ฐ€๋„ vs ์†๋„:** ์–‘์žํ™” ๋น„ํŠธ ์ˆ˜๊ฐ€ ๋‚ฎ์•„์งˆ์ˆ˜๋ก ์†๋„๋Š” ๋นจ๋ผ์ง€์ง€๋งŒ, ๋ณต์žกํ•œ ์ถ”๋ก ์ด๋‚˜ ์ˆ˜ํ•™์  ๋ฌธ์ œ์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜(Perplexity ์ฆ๊ฐ€)๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **์ง€์—ฐ ์‹œ๊ฐ„(Latency) vs ์ฒ˜๋ฆฌ๋Ÿ‰(Throughput):** ๋‹จ์ผ ์‚ฌ์šฉ์ž์˜ ๋น ๋ฅธ ์‘๋‹ต์„ ์œ„ํ•œ ์ตœ์ ํ™”์™€ ๋™์‹œ์— ์ˆ˜๋งŽ์€ ์‚ฌ์šฉ์ž๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์ ํ™” ์ „๋žต์€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **๋น„์šฉ vs ์„ฑ๋Šฅ:** ๊ณ ์„ฑ๋Šฅ GPU ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฐํฌ์™€ ๋กœ์ปฌ/์—ฃ์ง€ ๋ฐฐํฌ ๊ฐ„์˜ ๋น„์šฉ ๋Œ€๋น„ ์ง€๋Šฅ ์ˆ˜์ค€์„ ํ”„๋กœ์ ํŠธ ๋ชฉ์ ์— ๋งž๊ฒŒ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. --- ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - **Parent:** [[10_Wiki/Topics]] - **Related:** [[Transformer_Architecture_and_LLM_Foundations]], [[แ„ƒแ…ฆแ„‹แ…ตแ„แ…ฅ_แ„‰แ…กแ„‹แ…ตแ„‹แ…ฅแ†ซแ„‰แ…ณ_แ„†แ…ตแ†พ_ML_แ„‹แ…ฆแ†ซแ„Œแ…ตแ„‚แ…ตแ„‹แ…ฅแ„…แ…ตแ†ผ|Neural_Networks_and_Deep_Learning_Foundations]], [[LLM_Ops_and_Tuning]] - **Redirects:** [[LLM_Optimization_and_Deployment_Strategies|Quantization]], [[PEFT]], [[LLM_Optimization_and_Deployment_Strategies|vLLM]], [[Model_Compression]], [[Ollama]] --- *Last updated: 2026-05-08* ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - *(TODO)* **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - *(TODO)* ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** needs_review - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** A - **๊ฒ€ํ†  ์ด์œ :** *(P-Reinforce Phase 1 ์ž๋™ ์ •๊ทœํ™”. ๋ณธ๋ฌธ ๊ฒ€์ฆ ํ•„์š”.)* ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** *(TODO: ์ธ๋ฑ์„œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌํฌํŠธ ์ฐธ์กฐ)* - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** UPDATE (์ž๋™ ์ •๊ทœํ™”) - **์ฒ˜๋ฆฌ ์ด์œ :** Phase 1 ์ •๊ทœํ™” โ€” ์˜› ํ…œํ”Œ๋ฆฟ/๋ˆ„๋ฝ ํ•„๋“œ ๋ณด๊ฐ•. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ์—†์Œ - **์ •์ฑ… ๋ณ€ํ™”:** ์—†์Œ ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” (frontmatter + ํ—ค๋” ํ‘œ์ค€ํ™”) | UPDATE | A | ## ๐Ÿ’ป ์ฝ”๋“œ ํŒจํ„ด (Code Patterns) **ํŒจํ„ด 1:** *(TODO: ์ด ํ”„๋กœ์ ํŠธ ์ปจ๋ฒค์…˜ ๋ฐ˜์˜ํ•œ ๊ตฌ์กฐ ์Šค์ผˆ๋ ˆํ†ค)* ```text # TODO ``` ## ๐Ÿค” ์˜์‚ฌ๊ฒฐ์ • ๊ธฐ์ค€ (Decision Criteria) **์„ ํƒ A๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **์„ ํƒ B๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **๊ธฐ๋ณธ๊ฐ’:** > *(TODO)* ## โŒ ์•ˆํ‹ฐํŒจํ„ด (Anti-Patterns) - **[์•ˆํ‹ฐํŒจํ„ด]:** *(TODO: ๋ฌด์—‡์„ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€ + ์ด์œ  + ๋Œ€์‹  ๋ฌด์—‡์„)*