--- id: AI-OPT-QUAN-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, [[Deep-Learning|Deep-Learning]], [[Quantization|Quantization]], [[Model-Compression|Model-Compression]], int8, fp16, [[Optimization|Optimization]], inference-speedup] last_reinforced: 2026-04-26 --- # Quantization Foundations (์–‘์žํ™” ๊ธฐ์ดˆ) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ •๋ฐ€ํ•œ ๋ถ€๋™์†Œ์ˆ˜์ (FP32)์˜ ์‚ฌ์น˜๋ฅผ ๋ฒ„๋ฆฌ๊ณ  ๊ฑฐ์นœ ์ •์ˆ˜(INT8)์˜ ํšจ์œจ์„ ์„ ํƒํ•˜์—ฌ, ์ง€๋Šฅ์„ ๋น„ํŠธ ๋‹จ์œ„๋กœ ์••์ถ•ํ•˜๊ณ  ์‹คํ–‰ ์†๋„๋ฅผ ๊ทนํ•œ์œผ๋กœ ๋Œ์–ด์˜ฌ๋ ค๋ผ" โ€” ์‹ ๊ฒฝ๋ง์˜ ๊ฐ€์ค‘์น˜์™€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๊ฐ’์„ ๋” ๋‚ฎ์€ ๋น„ํŠธ์˜ ์ •๋ฐ€๋„๋กœ ํ‘œํ˜„ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ณ  ์ถ”๋ก  ์†๋„๋ฅผ ๋†’์ด๋Š” ์ตœ์ ํ™” ๊ธฐ์ˆ . ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** "Precision-Throughput Tradeoff and Range Mapping" โ€” 32๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์  ๋ฐ์ดํ„ฐ๋ฅผ 8๋น„ํŠธ ์ •์ˆ˜ ๋“ฑ์œผ๋กœ ๋งคํ•‘ํ•  ๋•Œ ์ •๋ณด ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์Šค์ผ€์ผ(Scale)๊ณผ ์ œ๋กœํฌ์ธํŠธ(Zero-point)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ํ•˜๋“œ์›จ์–ด์˜ ์ •์ˆ˜ ์—ฐ์‚ฐ ๊ฐ€์†๊ธฐ(Tensor Cores ๋“ฑ)๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๋Š” ํŒจํ„ด. - **์ฃผ์š” ๊ธฐ๋ฒ•:** - **PTQ (Post-Training Quantization):** ํ•™์Šต์ด ๋๋‚œ ๋ชจ๋ธ์„ ๊ฐ„๋‹จํ•œ ๋ณด์ •(Calibration)์„ ํ†ตํ•ด ์ฆ‰์‹œ ์–‘์žํ™”. ํŽธ๋ฆฌํ•จ. - **QAT (Quantization Aware Training):** ํ•™์Šต ๊ณผ์ •์—์„œ ์–‘์žํ™”๋กœ ์ธํ•œ ์˜ค์ฐจ๋ฅผ ๋ฏธ๋ฆฌ ๊ณ ๋ คํ•˜์—ฌ ํ•™์Šต. ๋†’์€ ์ •ํ™•๋„ ์œ ์ง€. - **Weight-only vs Full Quantization:** ๊ฐ€์ค‘์น˜๋งŒ ์ค„์ผ์ง€, ์—ฐ์‚ฐ ๊ณผ์ • ์ „์ฒด๋ฅผ ์ค„์ผ์ง€์˜ ์ฐจ์ด. - **์˜์˜:** ์ˆ˜๋ฐฑ ๊ธฐ๊ฐ€๋ฐ”์ดํŠธ์˜ LLM ๋ชจ๋ธ์„ ์ผ๋ฐ˜ PC๋‚˜ ๋ชจ๋ฐ”์ผ ๊ธฐ๊ธฐ ๋ฉ”๋ชจ๋ฆฌ์— ๋‹ด์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” '๋งˆ๋ฒ• ๊ฐ™์€ ๋‹ค์ด์–ดํŠธ' ๊ธฐ์ˆ ์ด๋ฉฐ, ์—ฃ์ง€ ์ปดํ“จํŒ…์˜ ํ•„์ˆ˜ ์š”๊ฑด. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ๋น„ํŠธ๋ฅผ ์ค„์ด๋ฉด ์ง€๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์งˆ ๊ฒƒ์ด๋ผ๋Š” ์ดˆ๊ธฐ ์šฐ๋ ค์™€ ๋‹ฌ๋ฆฌ, ํ˜„๋Œ€์˜ 4๋น„ํŠธ(NF4) ํ˜น์€ 8๋น„ํŠธ ์–‘์žํ™” ๊ธฐ์ˆ ์€ 32๋น„ํŠธ ์›๋ณธ ๋Œ€๋น„ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ 1~2% ๋‚ด์™ธ๋กœ ๋ฐฉ์–ดํ•˜๋ฉฐ ์‹ค์šฉ์„ฑ์„ ์ž…์ฆํ•จ. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ํ”„๋กœ์ ํŠธ๋Š” ์—์ด์ „ํŠธ์˜ ์˜จ๋””๋ฐ”์ด์Šค ๋ฐฐํฌ ๋ฐ ์ถ”๋ก  ๋น„์šฉ ์ ˆ๊ฐ์„ ์œ„ํ•ด, ๋ชจ๋“  ์ฃผ๋ ฅ ๋ชจ๋ธ์— ๋Œ€ํ•ด INT8 ํ˜น์€ FP16 ์–‘์žํ™”๋ฅผ ๊ธฐ๋ณธ ์ ์šฉํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Pruning-Techniques|Pruning-Techniques]], Model-Compression-and-Deployment, [[NVIDIA-CUDA-and-AI|NVIDIA-CUDA-and-AI]], [[Optimization-in-AI|Optimization-in-AI]] - **Raw Source:** 10_Wiki/Topics/AI/Quantization-Foundations.md