Files
2nd/10_Wiki/Topics/AI_and_ML/Serverless-Computing-for-AI.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-serverless-computing-for-ai Serverless Computing for AI 10_Wiki/Topics verified self
Serverless GPU
Modal
Replicate
Cloudflare Workers AI
Lambda AI
none A 0.9 applied
infrastructure
serverless
deployment
mlops
2026-05-10 pending
language framework
python Modal / Replicate / AWS Lambda / Cloudflare Workers AI

Serverless Computing for AI

매 한 줄

"매 GPU/CPU 를 request 단위로 — scale to zero, cold start 감수, capacity 걱정 없이". 매 2026: Modal (Python-native, fast cold start), Replicate (model marketplace), Cloudflare Workers AI (edge inference), AWS Lambda (CPU/small GPU), Beam, RunPod Serverless 가 매 dominant. 매 inference 가 90%, training 은 매 spot/dedicated 이 더 economical.

매 핵심

매 platform 비교

  • Modal: Python decorator, ~2-15s cold start, A100/H100/H200, $/sec billing. 매 dev 친화.
  • Replicate: model-first, COG container. 매 community model marketplace.
  • Cloudflare Workers AI: edge, ~ms cold start, but limited models (curated).
  • AWS Lambda: 10GB memory, ~15min limit, 매 GPU 없음 (small). 매 CPU 추론용.
  • Bedrock / Vertex / Azure OpenAI: 매 fully-managed model API — serverless 끝판.
  • RunPod Serverless / Beam: lower price, longer cold start.

매 cold start

  • 매 model size 가 dominant. 7B fp16 ≈ 14GB → S3 → GPU pull 매 slow.
  • Mitigation: keep-warm (min replicas=1), persistent volume, model preload, FP8/INT4 quantization.
  • 매 2026 trick: snapshot-restore (CRIU-like), prefetch via NVMe-backed volume.

매 비용 모델

  • Per-second GPU. H100 ~$3-4/hr → $0.001/sec.
  • Cold start charged or not depending on platform.
  • 매 traffic burst: serverless win. Steady high QPS: dedicated instance win.

매 응용

  1. Image gen API (FLUX, SDXL on Replicate/Modal).
  2. Whisper batch transcription.
  3. Embedding service (BGE, E5).
  4. Webhook → LLM → DB pipeline.
  5. Background job (PDF extraction, OCR).
  6. Edge AI: Cloudflare Workers AI for low-latency global.

💻 패턴

Modal — image gen endpoint

import modal
app = modal.App("flux-api")
image = modal.Image.debian_slim().pip_install("torch", "diffusers", "transformers")

@app.cls(gpu="H100", image=image, scaledown_window=120)
class FLUX:
    @modal.enter()
    def load(self):
        from diffusers import FluxPipeline
        import torch
        self.pipe = FluxPipeline.from_pretrained(
            "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
        ).to("cuda")

    @modal.method()
    def generate(self, prompt: str):
        img = self.pipe(prompt, num_inference_steps=28).images[0]
        buf = io.BytesIO(); img.save(buf, "PNG"); return buf.getvalue()

@app.local_entrypoint()
def main():
    print(len(FLUX().generate.remote("a cat on mars")))

Replicate — run a public model

import replicate
out = replicate.run(
    "black-forest-labs/flux-1.1-pro",
    input={"prompt": "neon city", "aspect_ratio": "16:9"},
)
print(out)  # URL

Cloudflare Workers AI

export default {
  async fetch(req, env) {
    const ai = env.AI;
    const r = await ai.run("@cf/meta/llama-3.1-8b-instruct", {
      prompt: "Hello"
    });
    return Response.json(r);
  }
}

AWS Lambda — small CPU LLM (llama.cpp)

import boto3, json
def handler(event, ctx):
    from llama_cpp import Llama
    llm = Llama(model_path="/opt/ml/qwen2-0.5b.gguf", n_ctx=2048)
    out = llm(event["prompt"], max_tokens=128)
    return {"statusCode": 200, "body": json.dumps(out)}

Modal scheduled batch

@app.function(schedule=modal.Cron("0 2 * * *"), gpu="A10G")
def nightly_embed():
    docs = fetch_new_docs()
    embs = embed_model(docs)
    upsert(embs)

Keep-warm pattern

@app.cls(gpu="A100", min_containers=1, scaledown_window=600)
class WarmModel: ...

Streaming response (Modal + FastAPI)

@app.function(gpu="A100")
@modal.fastapi_endpoint(method="POST")
def stream(req: dict):
    from fastapi.responses import StreamingResponse
    def gen():
        for tok in llm.stream(req["prompt"]):
            yield tok
    return StreamingResponse(gen(), media_type="text/plain")

매 결정 기준

상황 Approach
Python ML researcher, fast iter Modal
Public model, no infra Replicate
Global low-latency edge Cloudflare Workers AI
Use OpenAI / Claude / Gemini API 매 already serverless — API 직접
Steady high QPS Dedicated GPU (vLLM on K8s)
CPU-only small model AWS Lambda + llama.cpp
Tightest price RunPod Serverless / Beam

기본값: Modal (Python-first dev), API providers (Claude/OpenAI) for foundation models.

🔗 Graph

🤖 LLM 활용

언제: traffic bursty, model 자주 swap, dev velocity 중요. 매 ops team 작음. 언제 X: steady very-high QPS (cost), strict latency SLA (cold start), data residency 제약.

안티패턴

  • Cold-start ignore: 매 user-facing 5-30s 첫 응답 → bad UX. Keep-warm 또는 prefetch.
  • Big container image: 50GB image → 매 pull 매번 길어짐. Layer caching, persistent volume.
  • Long-running training on serverless: 매 hourly cap, expensive. Spot 쓰기.
  • No timeout: 매 runaway request → bill explode.
  • State on local disk: 매 ephemeral — S3/Volume 사용.

🧪 검증 / 중복

  • Verified (Modal docs, Replicate docs, Cloudflare AI docs, AWS Lambda docs 2025-2026).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Modal/Replicate/CF/Lambda 2026 comparison