--- id: wiki-2026-0508-serverless-computing-for-ai title: Serverless Computing for AI category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Serverless GPU, Modal, Replicate, Cloudflare Workers AI, Lambda AI] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [infrastructure, serverless, deployment, mlops] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: Modal / Replicate / AWS Lambda / Cloudflare Workers AI --- # Serverless Computing for AI ## 매 한 줄 > **"매 GPU/CPU 를 request 단위로 — scale to zero, cold start 감수, capacity 걱정 없이"**. 매 2026: Modal (Python-native, fast cold start), Replicate (model marketplace), Cloudflare Workers AI (edge inference), AWS Lambda (CPU/small GPU), Beam, RunPod Serverless 가 매 dominant. 매 inference 가 90%, training 은 매 spot/dedicated 이 더 economical. ## 매 핵심 ### 매 platform 비교 - **Modal**: Python decorator, ~2-15s cold start, A100/H100/H200, $/sec billing. 매 dev 친화. - **Replicate**: model-first, COG container. 매 community model marketplace. - **Cloudflare Workers AI**: edge, ~ms cold start, but limited models (curated). - **AWS Lambda**: 10GB memory, ~15min limit, 매 GPU 없음 (small). 매 CPU 추론용. - **Bedrock / Vertex / Azure OpenAI**: 매 fully-managed model API — serverless 끝판. - **RunPod Serverless / Beam**: lower price, longer cold start. ### 매 cold start - 매 model size 가 dominant. 7B fp16 ≈ 14GB → S3 → GPU pull 매 slow. - Mitigation: keep-warm (min replicas=1), persistent volume, model preload, FP8/INT4 quantization. - 매 2026 trick: snapshot-restore (CRIU-like), prefetch via NVMe-backed volume. ### 매 비용 모델 - Per-second GPU. H100 ~$3-4/hr → $0.001/sec. - Cold start charged or not depending on platform. - 매 traffic burst: serverless win. Steady high QPS: dedicated instance win. ### 매 응용 1. Image gen API (FLUX, SDXL on Replicate/Modal). 2. Whisper batch transcription. 3. Embedding service (BGE, E5). 4. Webhook → LLM → DB pipeline. 5. Background job (PDF extraction, OCR). 6. Edge AI: Cloudflare Workers AI for low-latency global. ## 💻 패턴 ### Modal — image gen endpoint ```python import modal app = modal.App("flux-api") image = modal.Image.debian_slim().pip_install("torch", "diffusers", "transformers") @app.cls(gpu="H100", image=image, scaledown_window=120) class FLUX: @modal.enter() def load(self): from diffusers import FluxPipeline import torch self.pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cuda") @modal.method() def generate(self, prompt: str): img = self.pipe(prompt, num_inference_steps=28).images[0] buf = io.BytesIO(); img.save(buf, "PNG"); return buf.getvalue() @app.local_entrypoint() def main(): print(len(FLUX().generate.remote("a cat on mars"))) ``` ### Replicate — run a public model ```python import replicate out = replicate.run( "black-forest-labs/flux-1.1-pro", input={"prompt": "neon city", "aspect_ratio": "16:9"}, ) print(out) # URL ``` ### Cloudflare Workers AI ```javascript export default { async fetch(req, env) { const ai = env.AI; const r = await ai.run("@cf/meta/llama-3.1-8b-instruct", { prompt: "Hello" }); return Response.json(r); } } ``` ### AWS Lambda — small CPU LLM (llama.cpp) ```python import boto3, json def handler(event, ctx): from llama_cpp import Llama llm = Llama(model_path="/opt/ml/qwen2-0.5b.gguf", n_ctx=2048) out = llm(event["prompt"], max_tokens=128) return {"statusCode": 200, "body": json.dumps(out)} ``` ### Modal scheduled batch ```python @app.function(schedule=modal.Cron("0 2 * * *"), gpu="A10G") def nightly_embed(): docs = fetch_new_docs() embs = embed_model(docs) upsert(embs) ``` ### Keep-warm pattern ```python @app.cls(gpu="A100", min_containers=1, scaledown_window=600) class WarmModel: ... ``` ### Streaming response (Modal + FastAPI) ```python @app.function(gpu="A100") @modal.fastapi_endpoint(method="POST") def stream(req: dict): from fastapi.responses import StreamingResponse def gen(): for tok in llm.stream(req["prompt"]): yield tok return StreamingResponse(gen(), media_type="text/plain") ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Python ML researcher, fast iter | Modal | | Public model, no infra | Replicate | | Global low-latency edge | Cloudflare Workers AI | | Use OpenAI / Claude / Gemini API | 매 already serverless — API 직접 | | Steady high QPS | Dedicated GPU (vLLM on K8s) | | CPU-only small model | AWS Lambda + llama.cpp | | Tightest price | RunPod Serverless / Beam | **기본값**: Modal (Python-first dev), API providers (Claude/OpenAI) for foundation models. ## 🔗 Graph - 부모: [[MLOps]] - 변형: [[Modal]] · [[Replicate]] · [[Cloudflare-Workers-AI]] - 응용: [[Edge-AI]] - Adjacent: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Kubernetes]] ## 🤖 LLM 활용 **언제**: traffic bursty, model 자주 swap, dev velocity 중요. 매 ops team 작음. **언제 X**: steady very-high QPS (cost), strict latency SLA (cold start), data residency 제약. ## ❌ 안티패턴 - **Cold-start ignore**: 매 user-facing 5-30s 첫 응답 → bad UX. Keep-warm 또는 prefetch. - **Big container image**: 50GB image → 매 pull 매번 길어짐. Layer caching, persistent volume. - **Long-running training on serverless**: 매 hourly cap, expensive. Spot 쓰기. - **No timeout**: 매 runaway request → bill explode. - **State on local disk**: 매 ephemeral — S3/Volume 사용. ## 🧪 검증 / 중복 - Verified (Modal docs, Replicate docs, Cloudflare AI docs, AWS Lambda docs 2025-2026). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Modal/Replicate/CF/Lambda 2026 comparison |