--- id: [[P-Reinforce|P-Reinforce]]-AUTO-MCOQ-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, quantization, compression, fp8, int4, awq, gptq, gguf] last_reinforced: 2026-05-04 --- # [[Model Compression & Quantization|Model Compression & Quantization]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ง€๋Šฅ์˜ ๊ณ ๋†์ถ•: ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ์ˆซ์ž์˜ ์ •๋ฐ€๋„๋ฅผ ๋‚ฎ์ถ”์–ด(FP16 -> INT4), ์„ฑ๋Šฅ ์ €ํ•˜๋Š” ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ๋„ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์—ฐ์‚ฐ ์†๋„๋ฅผ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ํ•˜์ด์—”๋“œ ์ตœ์ ํ™” ๊ณต๋ฒ•." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ๊ฑฐ๋Œ€ ๋ชจ๋ธ์„ ์ผ๋ฐ˜ ํ•˜๋“œ์›จ์–ด์—์„œ ๊ตฌ๋™ํ•˜๊ฑฐ๋‚˜ ์ถ”๋ก  ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ํ•ต์‹ฌ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. 1. **์–‘์žํ™” (Quantization)**: * **์ •์˜**: ๊ฐ€์ค‘์น˜๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋น„ํŠธ ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. (์˜ˆ: 16๋น„ํŠธ ๋ถ€๋™์†Œ์ˆ˜์  $\rightarrow$ 4๋น„ํŠธ ์ •์ˆ˜) * **ํšจ๊ณผ**: ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์•ฝ 4๋ฐฐ ๊ฐ์†Œํ•˜๋ฉฐ, ๋” ํฐ ๋ชจ๋ธ์„ ๋” ์ž‘์€ GPU์— ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 2. **์ฃผ์š” ์ •๋ฐ€๋„ ํฌ๋งท**: * **FP8**: ์ตœ์‹  H100/B200 GPU์—์„œ ์ง€์›ํ•˜๋ฉฐ, ์†๋„์™€ ์ •ํ™•๋„์˜ ์ตœ์  ๊ท ํ˜•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. * **INT4/INT8**: ์ „ํ†ต์ ์ธ ์–‘์žํ™” ๋ฐฉ์‹์œผ๋กœ, ๋ชจ๋ฐ”์ผ์ด๋‚˜ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค์—์„œ๋„ ๋„๋ฆฌ ์“ฐ์ž…๋‹ˆ๋‹ค. * **NF4 (NormalFloat 4)**: QLoRA์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํŠน์ˆ˜ ํฌ๋งท์œผ๋กœ, ๊ฐ€์ค‘์น˜ ๋ถ„ํฌ์— ์ตœ์ ํ™”๋œ ์–‘์žํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 3. **๋Œ€ํ‘œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ & ํฌ๋งท**: * **AWQ / GPTQ**: ์ถ”๋ก  ์†๋„์™€ ์ •ํ™•๋„๋ฅผ ๋ชจ๋‘ ์žก์€ ๋ฐ์ดํ„ฐ ์ธ์‹(Data-aware) ์–‘์žํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. * **GGUF / EXL2**: llama.cpp ๋“ฑ CPU๋‚˜ ๋กœ์ปฌ ํ™˜๊ฒฝ์—์„œ LLM์„ ๊ตฌ๋™ํ•˜๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ํฌ๋งท์ž…๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **์ •ํ™•๋„ ํ•˜๋ฝ (Precision Loss)**: ๋น„ํŠธ ์ˆ˜๋ฅผ ๋„ˆ๋ฌด ๊ณผํ•˜๊ฒŒ ์ค„์ด๋ฉด ๋ชจ๋ธ์˜ ๋…ผ๋ฆฌ ์ „๊ฐœ ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง€๊ฑฐ๋‚˜ ํ™˜๊ฐ์ด ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (ํŠนํžˆ 3๋น„ํŠธ ์ดํ•˜์—์„œ ๋‘๋“œ๋Ÿฌ์ง) * **ํ•˜๋“œ์›จ์–ด ํ˜ธํ™˜์„ฑ**: FP8๊ณผ ๊ฐ™์€ ์ตœ์‹  ํฌ๋งท์€ ๊ตฌํ˜• GPU(RTX 30 ์‹œ๋ฆฌ์ฆˆ ์ดํ•˜)์—์„œ๋Š” ๊ฐ€์† ํšจ๊ณผ๊ฐ€ ๋ฏธ๋ฏธํ•˜๊ฑฐ๋‚˜ ์ž‘๋™ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[LLM Inference Optimization|LLM Inference Optimization]] * **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[PEFT & LoRA|PEFT & LoRA]] (QLoRA), [[Deployment Frameworks|Deployment Frameworks]] * **์ฃผ์š” ํˆด**: bitsandbytes, AutoAWQ, llama.cpp --- *Last updated: 2026-05-04*