Files
2nd/10_Wiki/Topics/AI_and_ML/Mobile-AI-Optimization.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.9 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-mobile-ai-optimization Mobile AI Optimization 10_Wiki/Topics verified self
On-device AI
Mobile Inference
Edge ML
Mobile AI 최적화
none A 0.9 applied
mobile
edge
inference
coreml
tflite
mlx
npu
quantization
distillation
2026-05-10 pending
language framework
swift-kotlin-python coreml-tflite-mlx

Mobile AI Optimization

매 한 줄

"매 와트, 매 메가바이트, 매 밀리초". Mobile AI 최적화는 모델 크기·메모리·전력·지연을 동시에 압축해 NPU/GPU/CPU 가 실시간으로 돌릴 수 있게 만드는 기술 묶음 — quantization, distillation, NPU compile, runtime cache 가 핵심 축.

매 핵심

매 타겟 하드웨어

  • Apple Neural Engine (ANE): A12+/M-series, INT8/FP16, CoreML 통한 접근.
  • Apple Silicon GPU + MLX: M1+ 통합 메모리, FP16/BF16 LLM 친화.
  • Qualcomm Hexagon NPU (SNPE/QNN): Snapdragon 8 Gen3+, INT4/INT8 가속.
  • Google Tensor TPU / NNAPI: Pixel 시리즈, edge TPU.
  • MediaTek APU, Samsung Exynos NPU — 각자 SDK.

매 최적화 축

  1. 모델 압축: pruning, quantization (FP16/INT8/INT4), low-rank factorization.
  2. 지식 증류 (distillation): 큰 teacher → 작은 student.
  3. 아키텍처 search: MobileNet, EfficientNet, MobileViT, Phi-3-mini, Gemma-Nano.
  4. 컴파일러: CoreML, TFLite, ONNX Runtime Mobile, ExecuTorch, MLX, MLC-LLM.
  5. 런타임: graph fusion, weight sharing, KV cache, speculative decode (LLM).
  6. System: thermal throttle 회피, BG mode 제한, ANE/NPU off-load.

매 LLM-on-device 트렌드

  • 1B7B 모델 + 4-bit quant → iPhone/Android 실행.
  • Apple Foundation Models (~3B), Gemini Nano, Phi-3-mini, Llama-3.2-1B/3B.
  • KV cache 가 메모리 dominant — paged / sliding window.

💻 패턴

1. PyTorch → CoreML

import coremltools as ct, torch
model.eval()
example = torch.randn(1,3,224,224)
ts = torch.jit.trace(model, example)
mlmodel = ct.convert(
    ts, inputs=[ct.ImageType(shape=example.shape, scale=1/255.)],
    compute_units=ct.ComputeUnit.CPU_AND_NE,
    minimum_deployment_target=ct.target.iOS17
)
mlmodel = ct.models.neural_network.quantization_utils.quantize_weights(mlmodel, nbits=8)
mlmodel.save("Model.mlpackage")

2. TFLite INT8 PTQ

import tensorflow as tf
conv = tf.lite.TFLiteConverter.from_saved_model("model")
conv.optimizations = [tf.lite.Optimize.DEFAULT]
conv.representative_dataset = lambda: ([np.random.randn(1,224,224,3).astype(np.float32)] for _ in range(100))
conv.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
conv.inference_input_type = tf.int8
conv.inference_output_type = tf.int8
open("model_int8.tflite","wb").write(conv.convert())

3. MLX LLM (Apple Silicon)

from mlx_lm import load, generate
model, tok = load("mlx-community/Llama-3.2-3B-Instruct-4bit")
print(generate(model, tok, prompt="Hello", max_tokens=128))

4. Knowledge distillation

T = 4.0
loss_kd = (T*T) * F.kl_div(F.log_softmax(s/T,-1), F.softmax(t/T,-1), reduction="batchmean")
loss = 0.7*loss_kd + 0.3*F.cross_entropy(s, y)

5. Structured pruning

import torch.nn.utils.prune as P
P.ln_structured(layer, name="weight", amount=0.3, n=2, dim=0)   # channel prune
P.remove(layer, "weight")

6. ONNX Runtime Mobile (Android)

val env = OrtEnvironment.getEnvironment()
val opts = OrtSession.SessionOptions().apply { addNnapi() }
val session = env.createSession(modelBytes, opts)
val out = session.run(mapOf("input" to OnnxTensor.createTensor(env, input)))

7. KV cache + sliding window for LLM

# generate loop 에서 past_key_values 재사용 → O(N) → O(1)/token
out = model(input_ids, past_key_values=past, use_cache=True)
past = out.past_key_values

8. Thermal-aware scheduling

let thermal = ProcessInfo.processInfo.thermalState
let units: MLComputeUnits = (thermal == .nominal) ? .cpuAndNeuralEngine : .cpuOnly

9. Asset slimming

  • weight 압축 (zstd) + on-launch decompress, mmap, partial load.
  • multi-target export (small/medium/large) + adaptive download.

10. Profiling

xcrun coremlcompiler --target iphoneos17.0 ...
# Xcode Instruments → Core ML / Neural Engine / GPU / Energy

매 결정 기준

목표 우선 기법
latency ↓ NPU off-load + INT8 + graph fusion
모델 크기 ↓ INT4 quant + pruning + distillation
전력 ↓ NPU 전용 + 낮은 freq + thermal-aware
정확도 보존 QAT > PTQ, distill
빠른 iteration TFLite/CoreML PTQ → 바로 측정
LLM-on-device 4-bit + KV cache + 작은 ctx

기본값: PTQ INT8 + NPU 컴파일, accuracy drop > 1% 시 QAT, 6B+ LLM 은 4-bit + MLX/MLC.

🔗 Graph

🤖 LLM 활용

언제: pipeline boilerplate, quant config 추천, 모델 크기 추정. 언제 X: 실제 디바이스 latency/전력 측정 (실측 필수).

안티패턴

  • INT8 PTQ 후 정확도만 보고 deploy — latency/메모리 측정 누락.
  • NPU 미지원 op 가 들어가 매 step CPU fallback — graph 가 분절되면 NPU 이득 사라짐.
  • LLM ctx 32k on-device — KV cache 가 RAM 폭주.
  • BG 에서 무거운 inference — OS 가 kill 하거나 thermal throttle.
  • 한 디바이스만 테스트 — Snapdragon/Tensor/Apple 다 다른 결과.

🧪 검증 / 중복

  • Verified. 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup