--- id: wiki-2026-0508-이미지-생성-최적화-image-generation-opti title: 이미지 생성 최적화 (Image Generation Optimization) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Image Gen Optimization, Diffusion Inference Optimization, 이미지 생성 가속] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [ai, image-generation, optimization, inference, diffusion] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: diffusers-tensorrt --- # 이미지 생성 최적화 (Image Generation Optimization) ## 매 한 줄 > **"매 latency × cost × quality 의 trilemma 를 step reduction, quantization, compilation 으로 동시 해결"**. 2026 의 production image gen 은 distillation (4-step Schnell, Lightning, LCM), quantization (FP8/INT4), graph compilation (TensorRT, torch.compile), batch fusion 을 통해 50-step 30s → 4-step 0.5s 로 압축한다. 매 quality 손실 은 perceptual eval 에서 < 5%. ## 매 핵심 ### 매 optimization axes - **Steps**: 50 → 4 (distillation). - **Precision**: FP32 → FP16 → FP8 → INT4. - **Compilation**: eager → torch.compile → TensorRT. - **Caching**: KV cache, prompt embed cache, latent cache. - **Resolution**: 1024 → progressive (256→512→1024). - **Batching**: dynamic batching, continuous batching. ### 매 distillation 기법 - **LCM**: Latent Consistency Model, 4-step. - **SDXL Lightning**: 1/2/4/8-step variants. - **Hyper-SD**: 1-step possible. - **FLUX Schnell**: 4-step out-of-box. - **DMD2**: distribution matching, single-step quality. ### 매 응용 1. Realtime gen 의 sub-second UX (Krea, Magnific). 2. On-device mobile gen (Core ML, MLC). 3. Mass batch render 의 throughput max. ## 💻 패턴 ### Step reduction (LCM-LoRA) ```python from diffusers import StableDiffusionXLPipeline, LCMScheduler import torch pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda") pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") # 4-step gen img = pipe(prompt, num_inference_steps=4, guidance_scale=1.0).images[0] # 50-step (3.5s) → 4-step (0.4s) on A100 ``` ### torch.compile ```python pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead") # warmup _ = pipe("warmup", num_inference_steps=4) # 1.4-2x speedup after warmup ``` ### TensorRT (production) ```python # Export → TensorRT engine from polygraphy.backend.trt import EngineFromNetwork, NetworkFromOnnxPath, TrtRunner # 1. ONNX export torch.onnx.export(pipe.unet, dummy_inputs, "unet.onnx", opset_version=17, dynamic_axes={...}) # 2. trtexec build # trtexec --onnx=unet.onnx --saveEngine=unet.plan --fp16 --memPoolSize=workspace:8192 # 3. Runtime with TrtRunner(EngineFromNetwork(NetworkFromOnnxPath("unet.onnx"))) as r: out = r.infer({"sample": x, "timestep": t, "encoder_hidden_states": h}) # 2-3x faster than torch.compile ``` ### FP8 quantization (Hopper / Ada) ```python from optimum.quanto import quantize, qfloat8, freeze quantize(pipe.transformer, weights=qfloat8, activations=qfloat8) freeze(pipe.transformer) # memory: 24GB → 13GB; latency: 1.3x faster on H100 ``` ### Prompt embed cache ```python import hashlib, pickle from pathlib import Path class EmbedCache: def __init__(self, dir="./.embed_cache"): self.dir = Path(dir); self.dir.mkdir(exist_ok=True) def get_or_compute(self, prompt, encoder_fn): key = hashlib.sha256(prompt.encode()).hexdigest() p = self.dir / f"{key}.pt" if p.exists(): return torch.load(p) emb = encoder_fn(prompt) torch.save(emb, p) return emb cache = EmbedCache() emb = cache.get_or_compute(prompt, pipe.encode_prompt) # repeat prompt: skip text encoder entirely ``` ### Continuous batching (server) ```python # vLLM-style continuous batching for diffusion (sdxl-batched-server) from collections import deque import asyncio class BatchedServer: def __init__(self, max_batch=8, wait_ms=20): self.q = deque(); self.max_batch = max_batch; self.wait_ms = wait_ms async def submit(self, prompt): fut = asyncio.Future(); self.q.append((prompt, fut)) return await fut async def loop(self): while True: await asyncio.sleep(self.wait_ms/1000) if not self.q: continue batch = [self.q.popleft() for _ in range(min(len(self.q), self.max_batch))] prompts = [p for p,_ in batch] imgs = pipe(prompts).images for (_, fut), img in zip(batch, imgs): fut.set_result(img) ``` ### Progressive resolution ```python # Cascade: 256 → 512 → 1024 img_lo = pipe(prompt, height=256, width=256, num_inference_steps=8).images[0] img_md = img2img_pipe(prompt, image=img_lo, strength=0.5, height=512, width=512, num_inference_steps=8).images[0] img_hi = img2img_pipe(prompt, image=img_md, strength=0.3, height=1024, width=1024, num_inference_steps=8).images[0] # Total cost < single-pass 1024 ``` ### MLX (Apple Silicon) ```python import mlx.core as mx from mlx_diffusion import StableDiffusion sd = StableDiffusion("stabilityai/sdxl-turbo", float16=True) img = sd.generate("a cat", n_steps=4, n_images=4) # M3 Max: 4-step 1024px in ~1.2s ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | latency critical | distill (4-step) + TensorRT | | memory tight | FP8/INT4 quantize | | Apple device | MLX | | repeat prompts | embed cache | | many concurrent | continuous batch | | highest quality | full 50-step + xformers | **기본값**: 4-step LCM/Lightning + torch.compile + FP16, escalate to TRT for >10 RPS. ## 🔗 Graph - 부모: [[AI 이미지 생성 (AI Image Generation)]] - Adjacent: [[TensorRT]] · [[torch.compile]] · [[오픈소스 이미지 모델 미세 조정 및 배포]] ## 🤖 LLM 활용 **언제**: bottleneck profiling interpretation, kernel fusion plan, deploy config. **언제 X**: low-level CUDA kernel writing — Triton/cutlass docs 직접 참조. ## ❌ 안티패턴 - **Optimize before profile**: nvtx/torch profiler 없이 추측. - **Over-distillation**: 1-step 이라 quality cliff — perceptual eval 누락. - **Quantize without calib**: dynamic quant 만으로 quality 폭락. - **Single-process bottleneck**: GIL 무시한 sync server. ## 🧪 검증 / 중복 - Verified (LCM paper Luo 2023, SDXL Lightning ByteDance 2024, NVIDIA TRT-LLM docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — distillation + quantize + compile stack. |