--- id: wiki-2026-0508-mobile-ai-optimization title: Mobile AI Optimization category: 10_Wiki/Topics status: verified canonical_id: self aliases: [On-device AI, Mobile Inference, Edge ML, Mobile AI 최적화] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [mobile, edge, inference, coreml, tflite, mlx, npu, quantization, distillation] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: swift-kotlin-python, framework: coreml-tflite-mlx } --- # Mobile AI Optimization ## 매 한 줄 > **"매 와트, 매 메가바이트, 매 밀리초"**. Mobile AI 최적화는 모델 크기·메모리·전력·지연을 동시에 압축해 NPU/GPU/CPU 가 실시간으로 돌릴 수 있게 만드는 기술 묶음 — quantization, distillation, NPU compile, runtime cache 가 핵심 축. ## 매 핵심 ### 매 타겟 하드웨어 - **Apple Neural Engine (ANE)**: A12+/M-series, INT8/FP16, CoreML 통한 접근. - **Apple Silicon GPU + MLX**: M1+ 통합 메모리, FP16/BF16 LLM 친화. - **Qualcomm Hexagon NPU (SNPE/QNN)**: Snapdragon 8 Gen3+, INT4/INT8 가속. - **Google Tensor TPU / NNAPI**: Pixel 시리즈, edge TPU. - **MediaTek APU**, **Samsung Exynos NPU** — 각자 SDK. ### 매 최적화 축 1. **모델 압축**: pruning, quantization (FP16/INT8/INT4), low-rank factorization. 2. **지식 증류 (distillation)**: 큰 teacher → 작은 student. 3. **아키텍처 search**: MobileNet, EfficientNet, MobileViT, Phi-3-mini, Gemma-Nano. 4. **컴파일러**: CoreML, TFLite, ONNX Runtime Mobile, ExecuTorch, MLX, MLC-LLM. 5. **런타임**: graph fusion, weight sharing, KV cache, speculative decode (LLM). 6. **System**: thermal throttle 회피, BG mode 제한, ANE/NPU off-load. ### 매 LLM-on-device 트렌드 - 1B–7B 모델 + 4-bit quant → iPhone/Android 실행. - Apple Foundation Models (~3B), Gemini Nano, Phi-3-mini, Llama-3.2-1B/3B. - KV cache 가 메모리 dominant — paged / sliding window. ## 💻 패턴 ### 1. PyTorch → CoreML ```python import coremltools as ct, torch model.eval() example = torch.randn(1,3,224,224) ts = torch.jit.trace(model, example) mlmodel = ct.convert( ts, inputs=[ct.ImageType(shape=example.shape, scale=1/255.)], compute_units=ct.ComputeUnit.CPU_AND_NE, minimum_deployment_target=ct.target.iOS17 ) mlmodel = ct.models.neural_network.quantization_utils.quantize_weights(mlmodel, nbits=8) mlmodel.save("Model.mlpackage") ``` ### 2. TFLite INT8 PTQ ```python import tensorflow as tf conv = tf.lite.TFLiteConverter.from_saved_model("model") conv.optimizations = [tf.lite.Optimize.DEFAULT] conv.representative_dataset = lambda: ([np.random.randn(1,224,224,3).astype(np.float32)] for _ in range(100)) conv.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] conv.inference_input_type = tf.int8 conv.inference_output_type = tf.int8 open("model_int8.tflite","wb").write(conv.convert()) ``` ### 3. MLX LLM (Apple Silicon) ```python from mlx_lm import load, generate model, tok = load("mlx-community/Llama-3.2-3B-Instruct-4bit") print(generate(model, tok, prompt="Hello", max_tokens=128)) ``` ### 4. Knowledge distillation ```python T = 4.0 loss_kd = (T*T) * F.kl_div(F.log_softmax(s/T,-1), F.softmax(t/T,-1), reduction="batchmean") loss = 0.7*loss_kd + 0.3*F.cross_entropy(s, y) ``` ### 5. Structured pruning ```python import torch.nn.utils.prune as P P.ln_structured(layer, name="weight", amount=0.3, n=2, dim=0) # channel prune P.remove(layer, "weight") ``` ### 6. ONNX Runtime Mobile (Android) ```kotlin val env = OrtEnvironment.getEnvironment() val opts = OrtSession.SessionOptions().apply { addNnapi() } val session = env.createSession(modelBytes, opts) val out = session.run(mapOf("input" to OnnxTensor.createTensor(env, input))) ``` ### 7. KV cache + sliding window for LLM ```python # generate loop 에서 past_key_values 재사용 → O(N) → O(1)/token out = model(input_ids, past_key_values=past, use_cache=True) past = out.past_key_values ``` ### 8. Thermal-aware scheduling ```swift let thermal = ProcessInfo.processInfo.thermalState let units: MLComputeUnits = (thermal == .nominal) ? .cpuAndNeuralEngine : .cpuOnly ``` ### 9. Asset slimming - weight 압축 (zstd) + on-launch decompress, mmap, partial load. - multi-target export (small/medium/large) + adaptive download. ### 10. Profiling ```bash xcrun coremlcompiler --target iphoneos17.0 ... # Xcode Instruments → Core ML / Neural Engine / GPU / Energy ``` ## 매 결정 기준 | 목표 | 우선 기법 | |---|---| | latency ↓ | NPU off-load + INT8 + graph fusion | | 모델 크기 ↓ | INT4 quant + pruning + distillation | | 전력 ↓ | NPU 전용 + 낮은 freq + thermal-aware | | 정확도 보존 | QAT > PTQ, distill | | 빠른 iteration | TFLite/CoreML PTQ → 바로 측정 | | LLM-on-device | 4-bit + KV cache + 작은 ctx | **기본값**: PTQ INT8 + NPU 컴파일, accuracy drop > 1% 시 QAT, 6B+ LLM 은 4-bit + MLX/MLC. ## 🔗 Graph - 부모: [[Edge-AI]], [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]] - 변형: [[LLM_Optimization_and_Deployment_Strategies|Quantization]], [[LLM_Optimization_and_Deployment_Strategies|Knowledge-Distillation]], [[NAS]] - Adjacent: [[ONNX-Runtime]] ## 🤖 LLM 활용 **언제**: pipeline boilerplate, quant config 추천, 모델 크기 추정. **언제 X**: 실제 디바이스 latency/전력 측정 (실측 필수). ## ❌ 안티패턴 - INT8 PTQ 후 정확도만 보고 deploy — latency/메모리 측정 누락. - NPU 미지원 op 가 들어가 매 step CPU fallback — graph 가 분절되면 NPU 이득 사라짐. - LLM ctx 32k on-device — KV cache 가 RAM 폭주. - BG 에서 무거운 inference — OS 가 kill 하거나 thermal throttle. - 한 디바이스만 테스트 — Snapdragon/Tensor/Apple 다 다른 결과. ## 🧪 검증 / 중복 - Verified. 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup |