--- id: AI-INF-OPT-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, deep-learning, inference, optimization, quantization, model-serving] last_reinforced: 2026-04-26 --- # Inference Optimization (์ถ”๋ก  ์ตœ์ ํ™”) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๋ชจ๋ธ์˜ ์ง€๋Šฅ์€ ์œ ์ง€ํ•˜๋˜, ์‹คํ–‰ ๋น„์šฉ๊ณผ ์ง€์—ฐ ์‹œ๊ฐ„(Latency)์€ ๊ทนํ•œ์œผ๋กœ ๊นŽ์•„๋‚ด์–ด ์‹ค์ „ ๋ฐฐ์น˜ ๋Šฅ๋ ฅ์„ ํ™•๋ณดํ•˜๋ผ" โ€” ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์‹ค์ œ ์„œ๋น„์Šค ํ™˜๊ฒฝ์—์„œ ๋” ๋น ๋ฅด๊ณ  ๊ฐ€๋ณ๊ฒŒ ๊ตฌ๋™ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ ๊ตฌ์กฐ์™€ ์—ฐ์‚ฐ ๋ฐฉ์‹์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ธฐ์ˆ . ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** "Lightweight Intelligence" โ€” ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘ ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ๋ถ€๋ถ„์„ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ์ •๋ฐ€๋„๋ฅผ ๋‚ฎ์ถ”์–ด, ํ•˜๋“œ์›จ์–ด ์ž์›์„ ๋œ ์“ฐ๋ฉด์„œ๋„ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๊ฒŒ ํ•˜๋Š” ํšจ์œจ์„ฑ ๊ทน๋Œ€ํ™” ํŒจํ„ด. - **์ฃผ์š” ์ตœ์ ํ™” ๊ธฐ๋ฒ•:** - **Quantization (์–‘์žํ™”):** FP32 ๊ฐ€์ค‘์น˜๋ฅผ INT8 ๋“ฑ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์—ฐ์‚ฐ ์†๋„ ๊ฐœ์„ . - **Pruning (๊ฐ€์ง€์น˜๊ธฐ):** ์„ฑ๋Šฅ์— ์˜ํ–ฅ์ด ์ ์€ ๋‰ด๋Ÿฐ์ด๋‚˜ ์—ฐ๊ฒฐ(Weights)์„ ์ œ๊ฑฐํ•˜์—ฌ ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™”. - **Knowledge Distillation (์ง€์‹ ์ฆ๋ฅ˜):** ๊ฑฐ๋Œ€ ๋ชจ๋ธ(Teacher)์˜ ์ง€์‹์„ ์ž‘์€ ๋ชจ๋ธ(Student)์—๊ฒŒ ์ „์ˆ˜. - **Operator Fusion:** ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšŸ์ˆ˜ ๊ฐ์†Œ. - **Caching:** ํŠธ๋žœ์Šคํฌ๋จธ์˜ KV Cache ๋“ฑ ๋ฐ˜๋ณต ์—ฐ์‚ฐ ๊ฒฐ๊ณผ ์žฌ์‚ฌ์šฉ. - **์˜์˜:** AI ๋ชจ๋ธ์ด ์—ฐ๊ตฌ์‹ค์„ ๋„˜์–ด ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ๋‚˜ ์‹ค์‹œ๊ฐ„ ์‘๋‹ต์ด ํ•„์š”ํ•œ ๋Œ€๊ทœ๋ชจ ์„œ๋น„์Šค์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ํ•ต์‹ฌ ๋™๋ ฅ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ๋ชจ๋ธ์ด ํด์ˆ˜๋ก ๋ฌด์กฐ๊ฑด ์ข‹๋‹ค๋Š” ๋ฏฟ์Œ์—์„œ ๋ฒ—์–ด๋‚˜, ์ด์ œ๋Š” ์ฃผ์–ด์ง„ ์ž์›(Budget) ๋‚ด์—์„œ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” '๋น„์œจ ํšจ์œจ์  ์ง€๋Šฅ'์ด ์‚ฐ์—…๊ณ„์˜ ํ‘œ์ค€์œผ๋กœ ์ž๋ฆฌ ์žก์Œ. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ํ”„๋กœ์ ํŠธ๋Š” ๋กœ์ปฌ ๋ธŒ๋ ˆ์ธ ๊ตฌ๋™ ์‹œ ๊ฐ€์šฉ VRAM ์šฉ๋Ÿ‰์— ๋”ฐ๋ผ ๋ชจ๋ธ์„ 4-bit ๋˜๋Š” 8-bit๋กœ ๋™์  ์–‘์žํ™”ํ•˜์—ฌ, ์ €์‚ฌ์–‘ ๊ธฐ๊ธฐ์—์„œ๋„ ์ดˆ์ €์ง€์—ฐ ์‘๋‹ต์„ ๋ณด์žฅํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Hardware-Acceleration-for-AI]], GPU-Architecture-for-AI, System-Design-for-AI-Scale, [[LLM]] - **Raw Source:** 10_Wiki/Topics/AI/Inference-Optimization.md