[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,91 +2,179 @@
 id: wiki-2026-0508-serverless-computing-for-ai
 title: Serverless Computing for AI
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [SYS-SERVERLESS-AI-001]
+aliases: [Serverless GPU, Modal, Replicate, Cloudflare Workers AI, Lambda AI]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [ai, infrastructure, serverless, cloud-computing, faas, aws-lambda, Scalability, MLOps]
+confidence_score: 0.9
+verification_status: applied
+tags: [infrastructure, serverless, deployment, mlops]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: python
+  framework: Modal / Replicate / AWS Lambda / Cloudflare Workers AI
 ---

-# Serverless Computing for AI (AI를 위한 서버리스 컴퓨팅)
+# Serverless Computing for AI

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "서버 관리의 짐을 클라우드에 넘기고 오직 '모델의 추론'에만 집중하며, 호출된 만큼만 비용을 지불하는 가장 경제적이고 유연한 지능형 인프라를 구축하라" — 인프라 구성이나 관리 없이 코드(함수) 단위로 AI 모델을 실행하고, 요청량에 따라 자원이 자동으로 할당되는 클라우드 컴퓨팅 모델.
+## 매 한 줄
+> **"매 GPU/CPU 를 request 단위로 — scale to zero, cold start 감수, capacity 걱정 없이"**. 매 2026: Modal (Python-native, fast cold start), Replicate (model marketplace), Cloudflare Workers AI (edge inference), AWS Lambda (CPU/small GPU), Beam, RunPod Serverless 가 매 dominant. 매 inference 가 90%, training 은 매 spot/dedicated 이 더 economical.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "Event-driven Inference and Pay-per-invocation" — 서버를 상시 가동하는 대신 특정 이벤트(API 호출, 데이터 업로드 등)가 발생할 때만 컨테이너를 띄워 AI 연산을 수행하고 즉시 자원을 반납하는 효율성 중심의 운영 패턴.
- **주요 특징 및 장점:**
-    - **No Server [[Management|Management]]:** 패치, 업데이트, 용량 계획 등 운영 부담 전무.
-    - **Elastic Scalability:** 수천 개의 동시 요청에도 자동 확장(Auto-scaling) 대응.
-    - **Cost [[Efficiency|Efficiency]]:** 실행 시간과 메모리 사용량에 대해서만 비용 청구.
- **한계점:**
-    - **Cold Start:** 오랜만에 실행할 때 발생하는 초기 지연 시간.
-    - **Execution Limits:** 실행 시간 및 메모리 용량의 제한.
- **의의:** 스타트업이나 개인 개발자가 대규모 인프라 투자 없이도 전 세계 사용자에게 AI 서비스를 안정적으로 제공할 수 있는 진입 장벽의 혁신.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 초기에는 가벼운 웹 요청 처리용으로만 여겨졌으나, 최근에는 GPU 지원 서버리스 서비스와 컨테이너 기반 서버리스(Knative 등)의 등장으로 무거운 딥러닝 모델 추론도 서버리스 환경에서 활발히 이루어지고 있음.
- **정책 변화:** Antigravity 프로젝트는 에이전트의 간헐적인 데이터 전처리 및 배치 분석 작업 시, 비용 최적화를 위해 서버리스 아키텍처를 우선적으로 활용함.
+### 매 platform 비교
+- **Modal**: Python decorator, ~2-15s cold start, A100/H100/H200, $/sec billing. 매 dev 친화.
+- **Replicate**: model-first, COG container. 매 community model marketplace.
+- **Cloudflare Workers AI**: edge, ~ms cold start, but limited models (curated).
+- **AWS Lambda**: 10GB memory, ~15min limit, 매 GPU 없음 (small). 매 CPU 추론용.
+- **Bedrock / Vertex / Azure OpenAI**: 매 fully-managed model API — serverless 끝판.
+- **RunPod Serverless / Beam**: lower price, longer cold start.

-## 🔗 지식 연결 (Graph)
- [[Scalability-in-AI-Systems|Scalability-in-AI-Systems]], Cloud-Computing-Foundations, [[Service-oriented-Architecture|Service-oriented-Architecture]], [[Optimization-in-AI|Optimization-in-AI]]
- **Raw Source:** 10_Wiki/Topics/AI/Serverless-Computing-for-AI.md
+### 매 cold start
+- 매 model size 가 dominant. 7B fp16 ≈ 14GB → S3 → GPU pull 매 slow.
+- Mitigation: keep-warm (min replicas=1), persistent volume, model preload, FP8/INT4 quantization.
+- 매 2026 trick: snapshot-restore (CRIU-like), prefetch via NVMe-backed volume.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 비용 모델
+- Per-second GPU. H100 ~$3-4/hr → $0.001/sec.
+- Cold start charged or not depending on platform.
+- 매 traffic burst: serverless win. Steady high QPS: dedicated instance win.

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+### 매 응용
+1. Image gen API (FLUX, SDXL on Replicate/Modal).
+2. Whisper batch transcription.
+3. Embedding service (BGE, E5).
+4. Webhook → LLM → DB pipeline.
+5. Background job (PDF extraction, OCR).
+6. Edge AI: Cloudflare Workers AI for low-latency global.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+## 💻 패턴

-## 🧪 검증 상태 (Validation)
+### Modal — image gen endpoint
+```python
+import modal
+app = modal.App("flux-api")
+image = modal.Image.debian_slim().pip_install("torch", "diffusers", "transformers")

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+@app.cls(gpu="H100", image=image, scaledown_window=120)
+class FLUX:
+    @modal.enter()
+    def load(self):
+        from diffusers import FluxPipeline
+        import torch
+        self.pipe = FluxPipeline.from_pretrained(
+            "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
+        ).to("cuda")

-## 🧬 중복 검사 (Duplicate Check)
+    @modal.method()
+    def generate(self, prompt: str):
+        img = self.pipe(prompt, num_inference_steps=28).images[0]
+        buf = io.BytesIO(); img.save(buf, "PNG"); return buf.getvalue()

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+@app.local_entrypoint()
+def main():
+    print(len(FLUX().generate.remote("a cat on mars")))
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### Replicate — run a public model
+```python
+import replicate
+out = replicate.run(
+    "black-forest-labs/flux-1.1-pro",
+    input={"prompt": "neon city", "aspect_ratio": "16:9"},
+)
+print(out)  # URL
+```

-**선택 A를 써야 할 때:**
- *(TODO)*
+### Cloudflare Workers AI
+```javascript
+export default {
+  async fetch(req, env) {
+    const ai = env.AI;
+    const r = await ai.run("@cf/meta/llama-3.1-8b-instruct", {
+      prompt: "Hello"
+    });
+    return Response.json(r);
+  }
+}
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+### AWS Lambda — small CPU LLM (llama.cpp)
+```python
+import boto3, json
+def handler(event, ctx):
+    from llama_cpp import Llama
+    llm = Llama(model_path="/opt/ml/qwen2-0.5b.gguf", n_ctx=2048)
+    out = llm(event["prompt"], max_tokens=128)
+    return {"statusCode": 200, "body": json.dumps(out)}
+```

-**기본값:**
-> *(TODO)*
+### Modal scheduled batch
+```python
+@app.function(schedule=modal.Cron("0 2 * * *"), gpu="A10G")
+def nightly_embed():
+    docs = fetch_new_docs()
+    embs = embed_model(docs)
+    upsert(embs)
+```

-## ❌ 안티패턴 (Anti-Patterns)
+### Keep-warm pattern
+```python
+@app.cls(gpu="A100", min_containers=1, scaledown_window=600)
+class WarmModel: ...
+```

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+### Streaming response (Modal + FastAPI)
+```python
+@app.function(gpu="A100")
+@modal.fastapi_endpoint(method="POST")
+def stream(req: dict):
+    from fastapi.responses import StreamingResponse
+    def gen():
+        for tok in llm.stream(req["prompt"]):
+            yield tok
+    return StreamingResponse(gen(), media_type="text/plain")
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| Python ML researcher, fast iter | Modal |
+| Public model, no infra | Replicate |
+| Global low-latency edge | Cloudflare Workers AI |
+| Use OpenAI / Claude / Gemini API | 매 already serverless — API 직접 |
+| Steady high QPS | Dedicated GPU (vLLM on K8s) |
+| CPU-only small model | AWS Lambda + llama.cpp |
+| Tightest price | RunPod Serverless / Beam |
+
+**기본값**: Modal (Python-first dev), API providers (Claude/OpenAI) for foundation models.
+
+## 🔗 Graph
+- 부모: [[Cloud-Computing]] · [[MLOps]]
+- 변형: [[Modal]] · [[Replicate]] · [[Cloudflare-Workers-AI]] · [[AWS-Lambda]]
+- 응용: [[Inference-Serving]] · [[Batch-Processing]] · [[Edge-AI]]
+- Adjacent: [[vLLM]] · [[Kubernetes]] · [[Container-Images]]
+
+## 🤖 LLM 활용
+**언제**: traffic bursty, model 자주 swap, dev velocity 중요. 매 ops team 작음.
+**언제 X**: steady very-high QPS (cost), strict latency SLA (cold start), data residency 제약.
+
+## ❌ 안티패턴
+- **Cold-start ignore**: 매 user-facing 5-30s 첫 응답 → bad UX. Keep-warm 또는 prefetch.
+- **Big container image**: 50GB image → 매 pull 매번 길어짐. Layer caching, persistent volume.
+- **Long-running training on serverless**: 매 hourly cap, expensive. Spot 쓰기.
+- **No timeout**: 매 runaway request → bill explode.
+- **State on local disk**: 매 ephemeral — S3/Volume 사용.
+
+## 🧪 검증 / 중복
+- Verified (Modal docs, Replicate docs, Cloudflare AI docs, AWS Lambda docs 2025-2026).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — Modal/Replicate/CF/Lambda 2026 comparison |