Modal / Replicate / AWS Lambda / Cloudflare Workers AI
Serverless Computing for AI
매 한 줄
"매 GPU/CPU 를 request 단위로 — scale to zero, cold start 감수, capacity 걱정 없이". 매 2026: Modal (Python-native, fast cold start), Replicate (model marketplace), Cloudflare Workers AI (edge inference), AWS Lambda (CPU/small GPU), Beam, RunPod Serverless 가 매 dominant. 매 inference 가 90%, training 은 매 spot/dedicated 이 더 economical.
매 핵심
매 platform 비교
Modal: Python decorator, ~2-15s cold start, A100/H100/H200, $/sec billing. 매 dev 친화.
Replicate: model-first, COG container. 매 community model marketplace.
매 model size 가 dominant. 7B fp16 ≈ 14GB → S3 → GPU pull 매 slow.
Mitigation: keep-warm (min replicas=1), persistent volume, model preload, FP8/INT4 quantization.
매 2026 trick: snapshot-restore (CRIU-like), prefetch via NVMe-backed volume.
매 비용 모델
Per-second GPU. H100 ~$3-4/hr → $0.001/sec.
Cold start charged or not depending on platform.
매 traffic burst: serverless win. Steady high QPS: dedicated instance win.
매 응용
Image gen API (FLUX, SDXL on Replicate/Modal).
Whisper batch transcription.
Embedding service (BGE, E5).
Webhook → LLM → DB pipeline.
Background job (PDF extraction, OCR).
Edge AI: Cloudflare Workers AI for low-latency global.
💻 패턴
Modal — image gen endpoint
importmodalapp=modal.App("flux-api")image=modal.Image.debian_slim().pip_install("torch","diffusers","transformers")@app.cls(gpu="H100",image=image,scaledown_window=120)classFLUX:@modal.enter()defload(self):fromdiffusersimportFluxPipelineimporttorchself.pipe=FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",torch_dtype=torch.bfloat16).to("cuda")@modal.method()defgenerate(self,prompt:str):img=self.pipe(prompt,num_inference_steps=28).images[0]buf=io.BytesIO();img.save(buf,"PNG");returnbuf.getvalue()@app.local_entrypoint()defmain():print(len(FLUX().generate.remote("a cat on mars")))
언제: traffic bursty, model 자주 swap, dev velocity 중요. 매 ops team 작음.
언제 X: steady very-high QPS (cost), strict latency SLA (cold start), data residency 제약.
❌ 안티패턴
Cold-start ignore: 매 user-facing 5-30s 첫 응답 → bad UX. Keep-warm 또는 prefetch.
Big container image: 50GB image → 매 pull 매번 길어짐. Layer caching, persistent volume.
Long-running training on serverless: 매 hourly cap, expensive. Spot 쓰기.