--- id: wiki-2026-0508-ultra-efficiency title: Ultra Efficiency category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Model Compression, Efficient Inference, BitNet, Sparse Models] duplicate_of: none source_trust_level: A confidence_score: 0.88 verification_status: applied tags: [efficiency, quantization, compression, sparse, distillation] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: bitsandbytes-llama_cpp-mlx --- # Ultra Efficiency ## 매 한 줄 > **"매 model이 quality는 유지하며 compute / memory / energy 를 극단적으로 줄이는 spectrum"**. 매 1-bit BitNet (Microsoft 2024), GPTQ/AWQ quantization, MoE sparse activation, structured pruning, distillation 의 stack 으로 frontier models 가 phone 에서 inference 가능. 매 2026 mainstream. ## 매 핵심 ### 매 four pillars 1. **Quantization**: FP16 → INT8 → INT4 → 1.58-bit (BitNet b1.58). 2. **Sparsity**: structured (2:4 NVIDIA), unstructured pruning, MoE. 3. **Distillation**: teacher → student (Phi-3, Llama 3.2, Gemma 2). 4. **Architecture**: linear attention (Mamba, RWKV), Mixture of Experts. ### 매 quantization spectrum | Format | Size | Quality loss | Tool | |---|---|---|---| | FP16 | 1x | baseline | - | | INT8 | 0.5x | ~1% | bitsandbytes | | INT4 (GPTQ/AWQ) | 0.25x | 2-3% | autogptq, AWQ | | INT4 (NF4 + double quant, QLoRA) | 0.25x | <2% | bitsandbytes | | 1.58-bit (BitNet) | ~0.1x | ~equiv at scale | bitnet.cpp | | 1-bit (HQQ, AQLM) | ~0.06x | 3-5% | research | ### 매 응용 1. On-device LLM (iPhone Neural Engine, Snapdragon NPU). 2. Cost reduction in cloud inference (vLLM + AWQ, 4x throughput). 3. Edge inference (Llama 3.2 1B / 3B on RPi). 4. Long-context (memory-bound → quantize KV cache). ## 💻 패턴 ### Pattern 1: AWQ quantization (vLLM) ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_path = "meta-llama/Llama-3.1-70B-Instruct" quant_path = "llama-3.1-70b-awq" quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4} model = AutoAWQForCausalLM.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) model.quantize(tokenizer, quant_config=quant_config) model.save_quantized(quant_path) ``` ### Pattern 2: bitsandbytes 4-bit load (QLoRA-ready) ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", quantization_config=bnb, device_map="auto", ) ``` ### Pattern 3: llama.cpp GGUF (CPU + Metal) ```bash # Quantize to Q4_K_M (4-bit, balanced) ./llama-quantize model.gguf model-Q4_K_M.gguf Q4_K_M # Run with Metal acceleration ./llama-cli -m model-Q4_K_M.gguf -p "Hello" -n 256 -ngl 99 ``` ### Pattern 4: MLX (Apple Silicon, 2026 default for Mac) ```python from mlx_lm import load, generate model, tokenizer = load("mlx-community/Llama-3.1-8B-Instruct-4bit") response = generate( model, tokenizer, prompt="Explain entropy", max_tokens=256, temp=0.7, ) ``` ### Pattern 5: Distillation loop (student-teacher) ```python import torch.nn.functional as F def distill_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.7): soft = F.kl_div( F.log_softmax(student_logits / T, dim=-1), F.softmax(teacher_logits / T, dim=-1), reduction="batchmean", ) * (T ** 2) hard = F.cross_entropy(student_logits, labels) return alpha * soft + (1 - alpha) * hard ``` ### Pattern 6: KV cache quantization (long context) ```python # vLLM kv cache INT8 — 2x context length on same GPU from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-3.1-70B", quantization="awq", kv_cache_dtype="fp8", # or "int8" max_model_len=128000, ) ``` ### Pattern 7: Structured sparsity (2:4 NVIDIA) ```python # Hopper / Ampere 의 2:4 structured sparse → 2x throughput from torch.sparse import to_sparse_semi_structured dense = layer.weight sparse = to_sparse_semi_structured(dense) # mask + compress layer.weight = nn.Parameter(sparse) ``` ## 매 결정 기준 | 상황 | 매 technique | |---|---| | Cloud GPU, throughput-bound | AWQ INT4 + vLLM | | Apple Silicon | MLX 4-bit | | CPU-only / edge | llama.cpp GGUF Q4_K_M | | Custom finetune on 24GB | QLoRA NF4 + LoRA | | Phone / NPU | BitNet b1.58 / Llama 3.2 1B | | Long context 128K+ | KV cache FP8 | **기본값**: AWQ INT4 for serving, MLX 4-bit for Mac dev, GGUF Q4_K_M for portable. ## 🔗 Graph - 부모: [[Model Compression]] - 변형: [[LLM_Optimization_and_Deployment_Strategies|Quantization]] · [[LLM_Optimization_and_Deployment_Strategies|Distillation]] · [[Mixture of Experts]] - 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[QLoRA]] - Adjacent: [[BitNet]] · [[Mamba]] ## 🤖 LLM 활용 **언제**: cost / latency / memory budget tight. on-device deployment. long context. **언제 X**: training new SOTA frontier (use full precision). ## ❌ 안티패턴 - **Quantize without calibration**: GPTQ/AWQ 의 calibration set 누락 → quality cliff. - **Pure 1-bit on small models**: BitNet은 scale 효과. <1B 에선 quality loss. - **Distill across architecture mismatch**: tokenizer 다르면 logit alignment 불가. - **Ignoring KV cache**: model 만 quantize, KV cache fp16 → memory bound 그대로. - **Over-quantize attention scores**: softmax precision 손실 → garbage output. ## 🧪 검증 / 중복 - Verified (BitNet b1.58 paper Microsoft 2024, AWQ MIT 2023, vLLM docs, MLX docs Apple 2024-2026). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — quantization spectrum + 2026 stack (BitNet, MLX, vLLM AWQ) |