Files
2nd/10_Wiki/Topics/AI_and_ML/이미지 생성 최적화 (Image Generation Optimization).md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

205 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-이미지-생성-최적화-image-generation-opti
title: 이미지 생성 최적화 (Image Generation Optimization)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Image Gen Optimization, Diffusion Inference Optimization, 이미지 생성 가속]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [ai, image-generation, optimization, inference, diffusion]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: diffusers-tensorrt
---
# 이미지 생성 최적화 (Image Generation Optimization)
## 매 한 줄
> **"매 latency × cost × quality 의 trilemma 를 step reduction, quantization, compilation 으로 동시 해결"**. 2026 의 production image gen 은 distillation (4-step Schnell, Lightning, LCM), quantization (FP8/INT4), graph compilation (TensorRT, torch.compile), batch fusion 을 통해 50-step 30s → 4-step 0.5s 로 압축한다. 매 quality 손실 은 perceptual eval 에서 < 5%.
## 매 핵심
### 매 optimization axes
- **Steps**: 50 → 4 (distillation).
- **Precision**: FP32 → FP16 → FP8 → INT4.
- **Compilation**: eager → torch.compile → TensorRT.
- **Caching**: KV cache, prompt embed cache, latent cache.
- **Resolution**: 1024 → progressive (256→512→1024).
- **Batching**: dynamic batching, continuous batching.
### 매 distillation 기법
- **LCM**: Latent Consistency Model, 4-step.
- **SDXL Lightning**: 1/2/4/8-step variants.
- **Hyper-SD**: 1-step possible.
- **FLUX Schnell**: 4-step out-of-box.
- **DMD2**: distribution matching, single-step quality.
### 매 응용
1. Realtime gen 의 sub-second UX (Krea, Magnific).
2. On-device mobile gen (Core ML, MLC).
3. Mass batch render 의 throughput max.
## 💻 패턴
### Step reduction (LCM-LoRA)
```python
from diffusers import StableDiffusionXLPipeline, LCMScheduler
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
# 4-step gen
img = pipe(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
# 50-step (3.5s) → 4-step (0.4s) on A100
```
### torch.compile
```python
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead")
# warmup
_ = pipe("warmup", num_inference_steps=4)
# 1.4-2x speedup after warmup
```
### TensorRT (production)
```python
# Export → TensorRT engine
from polygraphy.backend.trt import EngineFromNetwork, NetworkFromOnnxPath, TrtRunner
# 1. ONNX export
torch.onnx.export(pipe.unet, dummy_inputs, "unet.onnx",
opset_version=17, dynamic_axes={...})
# 2. trtexec build
# trtexec --onnx=unet.onnx --saveEngine=unet.plan --fp16 --memPoolSize=workspace:8192
# 3. Runtime
with TrtRunner(EngineFromNetwork(NetworkFromOnnxPath("unet.onnx"))) as r:
out = r.infer({"sample": x, "timestep": t, "encoder_hidden_states": h})
# 2-3x faster than torch.compile
```
### FP8 quantization (Hopper / Ada)
```python
from optimum.quanto import quantize, qfloat8, freeze
quantize(pipe.transformer, weights=qfloat8, activations=qfloat8)
freeze(pipe.transformer)
# memory: 24GB → 13GB; latency: 1.3x faster on H100
```
### Prompt embed cache
```python
import hashlib, pickle
from pathlib import Path
class EmbedCache:
def __init__(self, dir="./.embed_cache"):
self.dir = Path(dir); self.dir.mkdir(exist_ok=True)
def get_or_compute(self, prompt, encoder_fn):
key = hashlib.sha256(prompt.encode()).hexdigest()
p = self.dir / f"{key}.pt"
if p.exists(): return torch.load(p)
emb = encoder_fn(prompt)
torch.save(emb, p)
return emb
cache = EmbedCache()
emb = cache.get_or_compute(prompt, pipe.encode_prompt)
# repeat prompt: skip text encoder entirely
```
### Continuous batching (server)
```python
# vLLM-style continuous batching for diffusion (sdxl-batched-server)
from collections import deque
import asyncio
class BatchedServer:
def __init__(self, max_batch=8, wait_ms=20):
self.q = deque(); self.max_batch = max_batch; self.wait_ms = wait_ms
async def submit(self, prompt):
fut = asyncio.Future(); self.q.append((prompt, fut))
return await fut
async def loop(self):
while True:
await asyncio.sleep(self.wait_ms/1000)
if not self.q: continue
batch = [self.q.popleft() for _ in range(min(len(self.q), self.max_batch))]
prompts = [p for p,_ in batch]
imgs = pipe(prompts).images
for (_, fut), img in zip(batch, imgs): fut.set_result(img)
```
### Progressive resolution
```python
# Cascade: 256 → 512 → 1024
img_lo = pipe(prompt, height=256, width=256, num_inference_steps=8).images[0]
img_md = img2img_pipe(prompt, image=img_lo, strength=0.5,
height=512, width=512, num_inference_steps=8).images[0]
img_hi = img2img_pipe(prompt, image=img_md, strength=0.3,
height=1024, width=1024, num_inference_steps=8).images[0]
# Total cost < single-pass 1024
```
### MLX (Apple Silicon)
```python
import mlx.core as mx
from mlx_diffusion import StableDiffusion
sd = StableDiffusion("stabilityai/sdxl-turbo", float16=True)
img = sd.generate("a cat", n_steps=4, n_images=4)
# M3 Max: 4-step 1024px in ~1.2s
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| latency critical | distill (4-step) + TensorRT |
| memory tight | FP8/INT4 quantize |
| Apple device | MLX |
| repeat prompts | embed cache |
| many concurrent | continuous batch |
| highest quality | full 50-step + xformers |
**기본값**: 4-step LCM/Lightning + torch.compile + FP16, escalate to TRT for >10 RPS.
## 🔗 Graph
- 부모: [[AI Image Generation]]
- Adjacent: [[TensorRT]] · [[torch.compile]] · [[오픈소스 이미지 모델 미세 조정 및 배포]]
## 🤖 LLM 활용
**언제**: bottleneck profiling interpretation, kernel fusion plan, deploy config.
**언제 X**: low-level CUDA kernel writing — Triton/cutlass docs 직접 참조.
## ❌ 안티패턴
- **Optimize before profile**: nvtx/torch profiler 없이 추측.
- **Over-distillation**: 1-step 이라 quality cliff — perceptual eval 누락.
- **Quantize without calib**: dynamic quant 만으로 quality 폭락.
- **Single-process bottleneck**: GIL 무시한 sync server.
## 🧪 검증 / 중복
- Verified (LCM paper Luo 2023, SDXL Lightning ByteDance 2024, NVIDIA TRT-LLM docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — distillation + quantize + compile stack. |