id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id
title
category
status
canonical_id
aliases
duplicate_of
source_trust_level
confidence_score
verification_status
tags
raw_sources
last_reinforced
github_commit
tech_stack
wiki-2026-0508-ultra-efficiency
Ultra Efficiency
10_Wiki/Topics
verified
self
Model Compression
Efficient Inference
BitNet
Sparse Models
none
A
0.88
applied
efficiency
quantization
compression
sparse
distillation
2026-05-10
pending
language
framework
python
bitsandbytes-llama_cpp-mlx
Ultra Efficiency
매 한 줄
"매 model이 quality는 유지하며 compute / memory / energy 를 극단적으로 줄이는 spectrum" . 매 1-bit BitNet (Microsoft 2024), GPTQ/AWQ quantization, MoE sparse activation, structured pruning, distillation 의 stack 으로 frontier models 가 phone 에서 inference 가능. 매 2026 mainstream.
매 핵심
매 four pillars
Quantization : FP16 → INT8 → INT4 → 1.58-bit (BitNet b1.58).
Sparsity : structured (2:4 NVIDIA), unstructured pruning, MoE.
Distillation : teacher → student (Phi-3, Llama 3.2, Gemma 2).
Architecture : linear attention (Mamba, RWKV), Mixture of Experts.
매 quantization spectrum
Format
Size
Quality loss
Tool
FP16
1x
baseline
-
INT8
0.5x
~1%
bitsandbytes
INT4 (GPTQ/AWQ)
0.25x
2-3%
autogptq, AWQ
INT4 (NF4 + double quant, QLoRA)
0.25x
<2%
bitsandbytes
1.58-bit (BitNet)
~0.1x
~equiv at scale
bitnet.cpp
1-bit (HQQ, AQLM)
~0.06x
3-5%
research
매 응용
On-device LLM (iPhone Neural Engine, Snapdragon NPU).
Cost reduction in cloud inference (vLLM + AWQ, 4x throughput).
Edge inference (Llama 3.2 1B / 3B on RPi).
Long-context (memory-bound → quantize KV cache).
💻 패턴
Pattern 1: AWQ quantization (vLLM)
Pattern 2: bitsandbytes 4-bit load (QLoRA-ready)
Pattern 3: llama.cpp GGUF (CPU + Metal)
Pattern 4: MLX (Apple Silicon, 2026 default for Mac)
Pattern 5: Distillation loop (student-teacher)
Pattern 6: KV cache quantization (long context)
Pattern 7: Structured sparsity (2:4 NVIDIA)
매 결정 기준
상황
매 technique
Cloud GPU, throughput-bound
AWQ INT4 + vLLM
Apple Silicon
MLX 4-bit
CPU-only / edge
llama.cpp GGUF Q4_K_M
Custom finetune on 24GB
QLoRA NF4 + LoRA
Phone / NPU
BitNet b1.58 / Llama 3.2 1B
Long context 128K+
KV cache FP8
기본값 : AWQ INT4 for serving, MLX 4-bit for Mac dev, GGUF Q4_K_M for portable.
🔗 Graph
🤖 LLM 활용
언제 : cost / latency / memory budget tight. on-device deployment. long context.
언제 X : training new SOTA frontier (use full precision).
❌ 안티패턴
Quantize without calibration : GPTQ/AWQ 의 calibration set 누락 → quality cliff.
Pure 1-bit on small models : BitNet은 scale 효과. <1B 에선 quality loss.
Distill across architecture mismatch : tokenizer 다르면 logit alignment 불가.
Ignoring KV cache : model 만 quantize, KV cache fp16 → memory bound 그대로.
Over-quantize attention scores : softmax precision 손실 → garbage output.
🧪 검증 / 중복
Verified (BitNet b1.58 paper Microsoft 2024, AWQ MIT 2023, vLLM docs, MLX docs Apple 2024-2026).
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — quantization spectrum + 2026 stack (BitNet, MLX, vLLM AWQ)