---
id: wiki-2026-0508-serverless-computing-for-ai
title: Serverless Computing for AI
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Serverless GPU, Modal, Replicate, Cloudflare Workers AI, Lambda AI]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [infrastructure, serverless, deployment, mlops]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: Modal / Replicate / AWS Lambda / Cloudflare Workers AI
---

# Serverless Computing for AI

## 매 한 줄
> **"매 GPU/CPU 를 request 단위로 — scale to zero, cold start 감수, capacity 걱정 없이"**. 매 2026: Modal (Python-native, fast cold start), Replicate (model marketplace), Cloudflare Workers AI (edge inference), AWS Lambda (CPU/small GPU), Beam, RunPod Serverless 가 매 dominant. 매 inference 가 90%, training 은 매 spot/dedicated 이 더 economical.

## 매 핵심

### 매 platform 비교
- **Modal**: Python decorator, ~2-15s cold start, A100/H100/H200, $/sec billing. 매 dev 친화.
- **Replicate**: model-first, COG container. 매 community model marketplace.
- **Cloudflare Workers AI**: edge, ~ms cold start, but limited models (curated).
- **AWS Lambda**: 10GB memory, ~15min limit, 매 GPU 없음 (small). 매 CPU 추론용.
- **Bedrock / Vertex / Azure OpenAI**: 매 fully-managed model API — serverless 끝판.
- **RunPod Serverless / Beam**: lower price, longer cold start.

### 매 cold start
- 매 model size 가 dominant. 7B fp16 ≈ 14GB → S3 → GPU pull 매 slow.
- Mitigation: keep-warm (min replicas=1), persistent volume, model preload, FP8/INT4 quantization.
- 매 2026 trick: snapshot-restore (CRIU-like), prefetch via NVMe-backed volume.

### 매 비용 모델
- Per-second GPU. H100 ~$3-4/hr → $0.001/sec.
- Cold start charged or not depending on platform.
- 매 traffic burst: serverless win. Steady high QPS: dedicated instance win.

### 매 응용
1. Image gen API (FLUX, SDXL on Replicate/Modal).
2. Whisper batch transcription.
3. Embedding service (BGE, E5).
4. Webhook → LLM → DB pipeline.
5. Background job (PDF extraction, OCR).
6. Edge AI: Cloudflare Workers AI for low-latency global.

## 💻 패턴

### Modal — image gen endpoint
```python
import modal
app = modal.App("flux-api")
image = modal.Image.debian_slim().pip_install("torch", "diffusers", "transformers")

@app.cls(gpu="H100", image=image, scaledown_window=120)
class FLUX:
    @modal.enter()
    def load(self):
        from diffusers import FluxPipeline
        import torch
        self.pipe = FluxPipeline.from_pretrained(
            "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
        ).to("cuda")

    @modal.method()
    def generate(self, prompt: str):
        img = self.pipe(prompt, num_inference_steps=28).images[0]
        buf = io.BytesIO(); img.save(buf, "PNG"); return buf.getvalue()

@app.local_entrypoint()
def main():
    print(len(FLUX().generate.remote("a cat on mars")))
```

### Replicate — run a public model
```python
import replicate
out = replicate.run(
    "black-forest-labs/flux-1.1-pro",
    input={"prompt": "neon city", "aspect_ratio": "16:9"},
)
print(out)  # URL
```

### Cloudflare Workers AI
```javascript
export default {
  async fetch(req, env) {
    const ai = env.AI;
    const r = await ai.run("@cf/meta/llama-3.1-8b-instruct", {
      prompt: "Hello"
    });
    return Response.json(r);
  }
}
```

### AWS Lambda — small CPU LLM (llama.cpp)
```python
import boto3, json
def handler(event, ctx):
    from llama_cpp import Llama
    llm = Llama(model_path="/opt/ml/qwen2-0.5b.gguf", n_ctx=2048)
    out = llm(event["prompt"], max_tokens=128)
    return {"statusCode": 200, "body": json.dumps(out)}
```

### Modal scheduled batch
```python
@app.function(schedule=modal.Cron("0 2 * * *"), gpu="A10G")
def nightly_embed():
    docs = fetch_new_docs()
    embs = embed_model(docs)
    upsert(embs)
```

### Keep-warm pattern
```python
@app.cls(gpu="A100", min_containers=1, scaledown_window=600)
class WarmModel: ...
```

### Streaming response (Modal + FastAPI)
```python
@app.function(gpu="A100")
@modal.fastapi_endpoint(method="POST")
def stream(req: dict):
    from fastapi.responses import StreamingResponse
    def gen():
        for tok in llm.stream(req["prompt"]):
            yield tok
    return StreamingResponse(gen(), media_type="text/plain")
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Python ML researcher, fast iter | Modal |
| Public model, no infra | Replicate |
| Global low-latency edge | Cloudflare Workers AI |
| Use OpenAI / Claude / Gemini API | 매 already serverless — API 직접 |
| Steady high QPS | Dedicated GPU (vLLM on K8s) |
| CPU-only small model | AWS Lambda + llama.cpp |
| Tightest price | RunPod Serverless / Beam |

**기본값**: Modal (Python-first dev), API providers (Claude/OpenAI) for foundation models.

## 🔗 Graph
- 부모: [[MLOps]]
- 변형: [[Modal]] · [[Replicate]] · [[Cloudflare-Workers-AI]]
- 응용: [[Edge-AI]]
- Adjacent: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Kubernetes]]

## 🤖 LLM 활용
**언제**: traffic bursty, model 자주 swap, dev velocity 중요. 매 ops team 작음.
**언제 X**: steady very-high QPS (cost), strict latency SLA (cold start), data residency 제약.

## ❌ 안티패턴
- **Cold-start ignore**: 매 user-facing 5-30s 첫 응답 → bad UX. Keep-warm 또는 prefetch.
- **Big container image**: 50GB image → 매 pull 매번 길어짐. Layer caching, persistent volume.
- **Long-running training on serverless**: 매 hourly cap, expensive. Spot 쓰기.
- **No timeout**: 매 runaway request → bill explode.
- **State on local disk**: 매 ephemeral — S3/Volume 사용.

## 🧪 검증 / 중복
- Verified (Modal docs, Replicate docs, Cloudflare AI docs, AWS Lambda docs 2025-2026).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Modal/Replicate/CF/Lambda 2026 comparison |