2nd/10_Wiki/Topics/AI_and_ML/Mobile-AI-Optimization.md

---
id: wiki-2026-0508-mobile-ai-optimization
title: Mobile AI Optimization
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [On-device AI, Mobile Inference, Edge ML, Mobile AI 최적화]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [mobile, edge, inference, coreml, tflite, mlx, npu, quantization, distillation]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: swift-kotlin-python, framework: coreml-tflite-mlx }
---

# Mobile AI Optimization

## 매 한 줄
> **"매 와트, 매 메가바이트, 매 밀리초"**. Mobile AI 최적화는 모델 크기·메모리·전력·지연을 동시에 압축해 NPU/GPU/CPU 가 실시간으로 돌릴 수 있게 만드는 기술 묶음 — quantization, distillation, NPU compile, runtime cache 가 핵심 축.

## 매 핵심
### 매 타겟 하드웨어
- **Apple Neural Engine (ANE)**: A12+/M-series, INT8/FP16, CoreML 통한 접근.
- **Apple Silicon GPU + MLX**: M1+ 통합 메모리, FP16/BF16 LLM 친화.
- **Qualcomm Hexagon NPU (SNPE/QNN)**: Snapdragon 8 Gen3+, INT4/INT8 가속.
- **Google Tensor TPU / NNAPI**: Pixel 시리즈, edge TPU.
- **MediaTek APU**, **Samsung Exynos NPU** — 각자 SDK.

### 매 최적화 축
1. **모델 압축**: pruning, quantization (FP16/INT8/INT4), low-rank factorization.
2. **지식 증류 (distillation)**: 큰 teacher → 작은 student.
3. **아키텍처 search**: MobileNet, EfficientNet, MobileViT, Phi-3-mini, Gemma-Nano.
4. **컴파일러**: CoreML, TFLite, ONNX Runtime Mobile, ExecuTorch, MLX, MLC-LLM.
5. **런타임**: graph fusion, weight sharing, KV cache, speculative decode (LLM).
6. **System**: thermal throttle 회피, BG mode 제한, ANE/NPU off-load.

### 매 LLM-on-device 트렌드
- 1B–7B 모델 + 4-bit quant → iPhone/Android 실행.
- Apple Foundation Models (~3B), Gemini Nano, Phi-3-mini, Llama-3.2-1B/3B.
- KV cache 가 메모리 dominant — paged / sliding window.

## 💻 패턴
### 1. PyTorch → CoreML
```python
import coremltools as ct, torch
model.eval()
example = torch.randn(1,3,224,224)
ts = torch.jit.trace(model, example)
mlmodel = ct.convert(
    ts, inputs=[ct.ImageType(shape=example.shape, scale=1/255.)],
    compute_units=ct.ComputeUnit.CPU_AND_NE,
    minimum_deployment_target=ct.target.iOS17
)
mlmodel = ct.models.neural_network.quantization_utils.quantize_weights(mlmodel, nbits=8)
mlmodel.save("Model.mlpackage")
```

### 2. TFLite INT8 PTQ
```python
import tensorflow as tf
conv = tf.lite.TFLiteConverter.from_saved_model("model")
conv.optimizations = [tf.lite.Optimize.DEFAULT]
conv.representative_dataset = lambda: ([np.random.randn(1,224,224,3).astype(np.float32)] for _ in range(100))
conv.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
conv.inference_input_type = tf.int8
conv.inference_output_type = tf.int8
open("model_int8.tflite","wb").write(conv.convert())
```

### 3. MLX LLM (Apple Silicon)
```python
from mlx_lm import load, generate
model, tok = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=128))
```

### 4. Knowledge distillation
```python
T = 4.0
loss_kd = (T*T) * F.kl_div(F.log_softmax(s/T,-1), F.softmax(t/T,-1), reduction="batchmean")
loss = 0.7*loss_kd + 0.3*F.cross_entropy(s, y)
```

### 5. Structured pruning
```python
import torch.nn.utils.prune as P
P.ln_structured(layer, name="weight", amount=0.3, n=2, dim=0)   # channel prune
P.remove(layer, "weight")
```

### 6. ONNX Runtime Mobile (Android)
```kotlin
val env = OrtEnvironment.getEnvironment()
val opts = OrtSession.SessionOptions().apply { addNnapi() }
val session = env.createSession(modelBytes, opts)
val out = session.run(mapOf("input" to OnnxTensor.createTensor(env, input)))
```

### 7. KV cache + sliding window for LLM
```python
# generate loop 에서 past_key_values 재사용 → O(N) → O(1)/token
out = model(input_ids, past_key_values=past, use_cache=True)
past = out.past_key_values
```

### 8. Thermal-aware scheduling
```swift
let thermal = ProcessInfo.processInfo.thermalState
let units: MLComputeUnits = (thermal == .nominal) ? .cpuAndNeuralEngine : .cpuOnly
```

### 9. Asset slimming
- weight 압축 (zstd) + on-launch decompress, mmap, partial load.
- multi-target export (small/medium/large) + adaptive download.

### 10. Profiling
```bash
xcrun coremlcompiler --target iphoneos17.0 ...
# Xcode Instruments → Core ML / Neural Engine / GPU / Energy
```

## 매 결정 기준
| 목표 | 우선 기법 |
|---|---|
| latency ↓ | NPU off-load + INT8 + graph fusion |
| 모델 크기 ↓ | INT4 quant + pruning + distillation |
| 전력 ↓ | NPU 전용 + 낮은 freq + thermal-aware |
| 정확도 보존 | QAT > PTQ, distill |
| 빠른 iteration | TFLite/CoreML PTQ → 바로 측정 |
| LLM-on-device | 4-bit + KV cache + 작은 ctx |

**기본값**: PTQ INT8 + NPU 컴파일, accuracy drop > 1% 시 QAT, 6B+ LLM 은 4-bit + MLX/MLC.

## 🔗 Graph
- 부모: [[Edge-AI]], [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]]
- 변형: [[LLM_Optimization_and_Deployment_Strategies|Quantization]], [[LLM_Optimization_and_Deployment_Strategies|Knowledge-Distillation]], [[NAS]]
- Adjacent: [[ONNX-Runtime]]

## 🤖 LLM 활용
**언제**: pipeline boilerplate, quant config 추천, 모델 크기 추정.
**언제 X**: 실제 디바이스 latency/전력 측정 (실측 필수).

## ❌ 안티패턴
- INT8 PTQ 후 정확도만 보고 deploy — latency/메모리 측정 누락.
- NPU 미지원 op 가 들어가 매 step CPU fallback — graph 가 분절되면 NPU 이득 사라짐.
- LLM ctx 32k on-device — KV cache 가 RAM 폭주.
- BG 에서 무거운 inference — OS 가 kill 하거나 thermal throttle.
- 한 디바이스만 테스트 — Snapdragon/Tensor/Apple 다 다른 결과.

## 🧪 검증 / 중복
- Verified. 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup |