Files
2nd/10_Wiki/Topics/AI_and_ML/Mobile-AI-Optimization.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

161 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-mobile-ai-optimization
title: Mobile AI Optimization
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [On-device AI, Mobile Inference, Edge ML, Mobile AI 최적화]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [mobile, edge, inference, coreml, tflite, mlx, npu, quantization, distillation]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: swift-kotlin-python, framework: coreml-tflite-mlx }
---
# Mobile AI Optimization
## 매 한 줄
> **"매 와트, 매 메가바이트, 매 밀리초"**. Mobile AI 최적화는 모델 크기·메모리·전력·지연을 동시에 압축해 NPU/GPU/CPU 가 실시간으로 돌릴 수 있게 만드는 기술 묶음 — quantization, distillation, NPU compile, runtime cache 가 핵심 축.
## 매 핵심
### 매 타겟 하드웨어
- **Apple Neural Engine (ANE)**: A12+/M-series, INT8/FP16, CoreML 통한 접근.
- **Apple Silicon GPU + MLX**: M1+ 통합 메모리, FP16/BF16 LLM 친화.
- **Qualcomm Hexagon NPU (SNPE/QNN)**: Snapdragon 8 Gen3+, INT4/INT8 가속.
- **Google Tensor TPU / NNAPI**: Pixel 시리즈, edge TPU.
- **MediaTek APU**, **Samsung Exynos NPU** — 각자 SDK.
### 매 최적화 축
1. **모델 압축**: pruning, quantization (FP16/INT8/INT4), low-rank factorization.
2. **지식 증류 (distillation)**: 큰 teacher → 작은 student.
3. **아키텍처 search**: MobileNet, EfficientNet, MobileViT, Phi-3-mini, Gemma-Nano.
4. **컴파일러**: CoreML, TFLite, ONNX Runtime Mobile, ExecuTorch, MLX, MLC-LLM.
5. **런타임**: graph fusion, weight sharing, KV cache, speculative decode (LLM).
6. **System**: thermal throttle 회피, BG mode 제한, ANE/NPU off-load.
### 매 LLM-on-device 트렌드
- 1B7B 모델 + 4-bit quant → iPhone/Android 실행.
- Apple Foundation Models (~3B), Gemini Nano, Phi-3-mini, Llama-3.2-1B/3B.
- KV cache 가 메모리 dominant — paged / sliding window.
## 💻 패턴
### 1. PyTorch → CoreML
```python
import coremltools as ct, torch
model.eval()
example = torch.randn(1,3,224,224)
ts = torch.jit.trace(model, example)
mlmodel = ct.convert(
ts, inputs=[ct.ImageType(shape=example.shape, scale=1/255.)],
compute_units=ct.ComputeUnit.CPU_AND_NE,
minimum_deployment_target=ct.target.iOS17
)
mlmodel = ct.models.neural_network.quantization_utils.quantize_weights(mlmodel, nbits=8)
mlmodel.save("Model.mlpackage")
```
### 2. TFLite INT8 PTQ
```python
import tensorflow as tf
conv = tf.lite.TFLiteConverter.from_saved_model("model")
conv.optimizations = [tf.lite.Optimize.DEFAULT]
conv.representative_dataset = lambda: ([np.random.randn(1,224,224,3).astype(np.float32)] for _ in range(100))
conv.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
conv.inference_input_type = tf.int8
conv.inference_output_type = tf.int8
open("model_int8.tflite","wb").write(conv.convert())
```
### 3. MLX LLM (Apple Silicon)
```python
from mlx_lm import load, generate
model, tok = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=128))
```
### 4. Knowledge distillation
```python
T = 4.0
loss_kd = (T*T) * F.kl_div(F.log_softmax(s/T,-1), F.softmax(t/T,-1), reduction="batchmean")
loss = 0.7*loss_kd + 0.3*F.cross_entropy(s, y)
```
### 5. Structured pruning
```python
import torch.nn.utils.prune as P
P.ln_structured(layer, name="weight", amount=0.3, n=2, dim=0) # channel prune
P.remove(layer, "weight")
```
### 6. ONNX Runtime Mobile (Android)
```kotlin
val env = OrtEnvironment.getEnvironment()
val opts = OrtSession.SessionOptions().apply { addNnapi() }
val session = env.createSession(modelBytes, opts)
val out = session.run(mapOf("input" to OnnxTensor.createTensor(env, input)))
```
### 7. KV cache + sliding window for LLM
```python
# generate loop 에서 past_key_values 재사용 → O(N) → O(1)/token
out = model(input_ids, past_key_values=past, use_cache=True)
past = out.past_key_values
```
### 8. Thermal-aware scheduling
```swift
let thermal = ProcessInfo.processInfo.thermalState
let units: MLComputeUnits = (thermal == .nominal) ? .cpuAndNeuralEngine : .cpuOnly
```
### 9. Asset slimming
- weight 압축 (zstd) + on-launch decompress, mmap, partial load.
- multi-target export (small/medium/large) + adaptive download.
### 10. Profiling
```bash
xcrun coremlcompiler --target iphoneos17.0 ...
# Xcode Instruments → Core ML / Neural Engine / GPU / Energy
```
## 매 결정 기준
| 목표 | 우선 기법 |
|---|---|
| latency ↓ | NPU off-load + INT8 + graph fusion |
| 모델 크기 ↓ | INT4 quant + pruning + distillation |
| 전력 ↓ | NPU 전용 + 낮은 freq + thermal-aware |
| 정확도 보존 | QAT > PTQ, distill |
| 빠른 iteration | TFLite/CoreML PTQ → 바로 측정 |
| LLM-on-device | 4-bit + KV cache + 작은 ctx |
**기본값**: PTQ INT8 + NPU 컴파일, accuracy drop > 1% 시 QAT, 6B+ LLM 은 4-bit + MLX/MLC.
## 🔗 Graph
- 부모: [[Edge-AI]], [[LLM_Optimization_and_Deployment_Strategies|Model-Compression]]
- 변형: [[LLM_Optimization_and_Deployment_Strategies|Quantization]], [[LLM_Optimization_and_Deployment_Strategies|Knowledge-Distillation]], [[NAS]]
- Adjacent: [[ONNX-Runtime]]
## 🤖 LLM 활용
**언제**: pipeline boilerplate, quant config 추천, 모델 크기 추정.
**언제 X**: 실제 디바이스 latency/전력 측정 (실측 필수).
## ❌ 안티패턴
- INT8 PTQ 후 정확도만 보고 deploy — latency/메모리 측정 누락.
- NPU 미지원 op 가 들어가 매 step CPU fallback — graph 가 분절되면 NPU 이득 사라짐.
- LLM ctx 32k on-device — KV cache 가 RAM 폭주.
- BG 에서 무거운 inference — OS 가 kill 하거나 thermal throttle.
- 한 디바이스만 테스트 — Snapdragon/Tensor/Apple 다 다른 결과.
## 🧪 검증 / 중복
- Verified. 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup |