Files
2nd/10_Wiki/Topics/AI_and_ML/이미지 생성 최적화 (Image Generation Optimization).md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

205 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-이미지-생성-최적화-image-generation-opti
title: 이미지 생성 최적화 (Image Generation Optimization)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Image Gen Optimization, Diffusion Inference Optimization, 이미지 생성 가속]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [ai, image-generation, optimization, inference, diffusion]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: diffusers-tensorrt
---
# 이미지 생성 최적화 (Image Generation Optimization)
## 매 한 줄
> **"매 latency × cost × quality 의 trilemma 를 step reduction, quantization, compilation 으로 동시 해결"**. 2026 의 production image gen 은 distillation (4-step Schnell, Lightning, LCM), quantization (FP8/INT4), graph compilation (TensorRT, torch.compile), batch fusion 을 통해 50-step 30s → 4-step 0.5s 로 압축한다. 매 quality 손실 은 perceptual eval 에서 < 5%.
## 매 핵심
### 매 optimization axes
- **Steps**: 50 → 4 (distillation).
- **Precision**: FP32 → FP16 → FP8 → INT4.
- **Compilation**: eager → torch.compile → TensorRT.
- **Caching**: KV cache, prompt embed cache, latent cache.
- **Resolution**: 1024 → progressive (256→512→1024).
- **Batching**: dynamic batching, continuous batching.
### 매 distillation 기법
- **LCM**: Latent Consistency Model, 4-step.
- **SDXL Lightning**: 1/2/4/8-step variants.
- **Hyper-SD**: 1-step possible.
- **FLUX Schnell**: 4-step out-of-box.
- **DMD2**: distribution matching, single-step quality.
### 매 응용
1. Realtime gen 의 sub-second UX (Krea, Magnific).
2. On-device mobile gen (Core ML, MLC).
3. Mass batch render 의 throughput max.
## 💻 패턴
### Step reduction (LCM-LoRA)
```python
from diffusers import StableDiffusionXLPipeline, LCMScheduler
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
# 4-step gen
img = pipe(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
# 50-step (3.5s) → 4-step (0.4s) on A100
```
### torch.compile
```python
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead")
# warmup
_ = pipe("warmup", num_inference_steps=4)
# 1.4-2x speedup after warmup
```
### TensorRT (production)
```python
# Export → TensorRT engine
from polygraphy.backend.trt import EngineFromNetwork, NetworkFromOnnxPath, TrtRunner
# 1. ONNX export
torch.onnx.export(pipe.unet, dummy_inputs, "unet.onnx",
opset_version=17, dynamic_axes={...})
# 2. trtexec build
# trtexec --onnx=unet.onnx --saveEngine=unet.plan --fp16 --memPoolSize=workspace:8192
# 3. Runtime
with TrtRunner(EngineFromNetwork(NetworkFromOnnxPath("unet.onnx"))) as r:
out = r.infer({"sample": x, "timestep": t, "encoder_hidden_states": h})
# 2-3x faster than torch.compile
```
### FP8 quantization (Hopper / Ada)
```python
from optimum.quanto import quantize, qfloat8, freeze
quantize(pipe.transformer, weights=qfloat8, activations=qfloat8)
freeze(pipe.transformer)
# memory: 24GB → 13GB; latency: 1.3x faster on H100
```
### Prompt embed cache
```python
import hashlib, pickle
from pathlib import Path
class EmbedCache:
def __init__(self, dir="./.embed_cache"):
self.dir = Path(dir); self.dir.mkdir(exist_ok=True)
def get_or_compute(self, prompt, encoder_fn):
key = hashlib.sha256(prompt.encode()).hexdigest()
p = self.dir / f"{key}.pt"
if p.exists(): return torch.load(p)
emb = encoder_fn(prompt)
torch.save(emb, p)
return emb
cache = EmbedCache()
emb = cache.get_or_compute(prompt, pipe.encode_prompt)
# repeat prompt: skip text encoder entirely
```
### Continuous batching (server)
```python
# vLLM-style continuous batching for diffusion (sdxl-batched-server)
from collections import deque
import asyncio
class BatchedServer:
def __init__(self, max_batch=8, wait_ms=20):
self.q = deque(); self.max_batch = max_batch; self.wait_ms = wait_ms
async def submit(self, prompt):
fut = asyncio.Future(); self.q.append((prompt, fut))
return await fut
async def loop(self):
while True:
await asyncio.sleep(self.wait_ms/1000)
if not self.q: continue
batch = [self.q.popleft() for _ in range(min(len(self.q), self.max_batch))]
prompts = [p for p,_ in batch]
imgs = pipe(prompts).images
for (_, fut), img in zip(batch, imgs): fut.set_result(img)
```
### Progressive resolution
```python
# Cascade: 256 → 512 → 1024
img_lo = pipe(prompt, height=256, width=256, num_inference_steps=8).images[0]
img_md = img2img_pipe(prompt, image=img_lo, strength=0.5,
height=512, width=512, num_inference_steps=8).images[0]
img_hi = img2img_pipe(prompt, image=img_md, strength=0.3,
height=1024, width=1024, num_inference_steps=8).images[0]
# Total cost < single-pass 1024
```
### MLX (Apple Silicon)
```python
import mlx.core as mx
from mlx_diffusion import StableDiffusion
sd = StableDiffusion("stabilityai/sdxl-turbo", float16=True)
img = sd.generate("a cat", n_steps=4, n_images=4)
# M3 Max: 4-step 1024px in ~1.2s
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| latency critical | distill (4-step) + TensorRT |
| memory tight | FP8/INT4 quantize |
| Apple device | MLX |
| repeat prompts | embed cache |
| many concurrent | continuous batch |
| highest quality | full 50-step + xformers |
**기본값**: 4-step LCM/Lightning + torch.compile + FP16, escalate to TRT for >10 RPS.
## 🔗 Graph
- 부모: [[AI 이미지 생성 (AI Image Generation)]]
- Adjacent: [[TensorRT]] · [[torch.compile]] · [[오픈소스 이미지 모델 미세 조정 및 배포]]
## 🤖 LLM 활용
**언제**: bottleneck profiling interpretation, kernel fusion plan, deploy config.
**언제 X**: low-level CUDA kernel writing — Triton/cutlass docs 직접 참조.
## ❌ 안티패턴
- **Optimize before profile**: nvtx/torch profiler 없이 추측.
- **Over-distillation**: 1-step 이라 quality cliff — perceptual eval 누락.
- **Quantize without calib**: dynamic quant 만으로 quality 폭락.
- **Single-process bottleneck**: GIL 무시한 sync server.
## 🧪 검증 / 중복
- Verified (LCM paper Luo 2023, SDXL Lightning ByteDance 2024, NVIDIA TRT-LLM docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — distillation + quantize + compile stack. |