f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
209 lines
7.3 KiB
Markdown
209 lines
7.3 KiB
Markdown
---
|
||
id: wiki-2026-0508-scalability-in-ai-systems
|
||
title: Scalability in AI Systems
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [AI Scalability, Distributed Training, LLM Scaling]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.9
|
||
verification_status: applied
|
||
tags: [scalability, distributed-training, inference, vllm, fsdp, llm]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: PyTorch / vLLM / DeepSpeed
|
||
---
|
||
|
||
# Scalability in AI Systems
|
||
|
||
## 매 한 줄
|
||
> **"매 model size × data × users × latency 의 4축 동시 만족 — 매 single GPU → multi-node training (FSDP/ZeRO/TP/PP) 과 매 10K+ QPS inference (vLLM/SGLang/PagedAttention) 의 양 갈래"**. 매 2020 GPT-3 (175B) 부터 매 2026 trillion-param MoE (Llama 4 Behemoth, Claude Opus 4.7, GPT-5) 까지 매 scaling laws (Chinchilla, Hoffmann 2022) 가 산업 의 compass.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 Training scalability axes
|
||
- **Data parallel (DP)**: replicate model, shard batch — gradient all-reduce.
|
||
- **Tensor parallel (TP)**: split single layer across GPUs (Megatron-LM).
|
||
- **Pipeline parallel (PP)**: split layers across stages (GPipe, 1F1B).
|
||
- **FSDP / ZeRO**: shard params + grads + optimizer state (ZeRO-1/2/3).
|
||
- **Sequence/Context parallel**: shard along sequence (Ring Attention, DeepSpeed-Ulysses).
|
||
- **MoE expert parallel**: route tokens to expert subsets across GPUs.
|
||
|
||
### 매 Inference scalability
|
||
- **Continuous batching**: vLLM / TGI — token-level scheduling, no head-of-line block.
|
||
- **PagedAttention**: KV cache paged like virtual memory → high concurrency.
|
||
- **Prefix caching**: shared system prompt cache (vLLM, SGLang).
|
||
- **Speculative decoding**: small draft model proposes, large verifies → 2-3x.
|
||
- **Quantization**: FP8, INT4 (AWQ, GPTQ), MX formats (MXFP4) — 2026 H200/B200 native.
|
||
- **Disaggregated serving**: prefill nodes vs decode nodes (Splitwise, DistServe).
|
||
|
||
### 매 응용
|
||
1. Pretraining 70B+ on 256+ GPU cluster.
|
||
2. LoRA/QLoRA fine-tune on single H100 (24GB shard).
|
||
3. 100K+ concurrent chatbot QPS via vLLM cluster.
|
||
4. RAG at 1M+ docs with sharded vector DB.
|
||
5. Multi-tenant inference with tenant-aware caching.
|
||
|
||
## 💻 패턴
|
||
|
||
### FSDP2 with PyTorch (2026 idiom)
|
||
```python
|
||
import torch
|
||
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy
|
||
|
||
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B")
|
||
mp_policy = MixedPrecisionPolicy(
|
||
param_dtype=torch.bfloat16,
|
||
reduce_dtype=torch.float32,
|
||
)
|
||
for layer in model.model.layers:
|
||
fully_shard(layer, mp_policy=mp_policy)
|
||
fully_shard(model, mp_policy=mp_policy)
|
||
|
||
# Gradient checkpointing for memory
|
||
model.gradient_checkpointing_enable()
|
||
```
|
||
|
||
### DeepSpeed ZeRO-3 config
|
||
```json
|
||
{
|
||
"train_batch_size": 1024,
|
||
"gradient_accumulation_steps": 4,
|
||
"bf16": {"enabled": true},
|
||
"zero_optimization": {
|
||
"stage": 3,
|
||
"offload_param": {"device": "cpu"},
|
||
"offload_optimizer": {"device": "nvme", "nvme_path": "/mnt/nvme"},
|
||
"overlap_comm": true,
|
||
"contiguous_gradients": true
|
||
}
|
||
}
|
||
```
|
||
|
||
### vLLM serving (continuous batching + prefix cache)
|
||
```python
|
||
from vllm import LLM, SamplingParams
|
||
|
||
llm = LLM(
|
||
model="meta-llama/Llama-3.1-70B-Instruct",
|
||
tensor_parallel_size=4,
|
||
enable_prefix_caching=True,
|
||
max_model_len=128000,
|
||
quantization="fp8",
|
||
gpu_memory_utilization=0.92,
|
||
)
|
||
out = llm.generate(prompts, SamplingParams(max_tokens=512, temperature=0.7))
|
||
```
|
||
|
||
### SGLang RadixAttention (shared prefix tree)
|
||
```python
|
||
import sglang as sgl
|
||
|
||
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
|
||
|
||
@sgl.function
|
||
def multi_turn(s, system, turns):
|
||
s += sgl.system(system) # cached across all calls
|
||
for t in turns:
|
||
s += sgl.user(t["q"]) + sgl.assistant(sgl.gen(max_tokens=256))
|
||
|
||
# Massive throughput when many users share system prompt
|
||
```
|
||
|
||
### Speculative decoding (vLLM)
|
||
```python
|
||
llm = LLM(
|
||
model="meta-llama/Llama-3.1-70B-Instruct",
|
||
speculative_model="meta-llama/Llama-3.2-1B-Instruct",
|
||
num_speculative_tokens=5,
|
||
use_v2_block_manager=True,
|
||
)
|
||
```
|
||
|
||
### Megatron-LM tensor + pipeline parallel
|
||
```bash
|
||
torchrun --nproc_per_node=8 --nnodes=8 \
|
||
pretrain_gpt.py \
|
||
--tensor-model-parallel-size 8 \
|
||
--pipeline-model-parallel-size 8 \
|
||
--num-layers 80 --hidden-size 8192 \
|
||
--num-attention-heads 64 --seq-length 8192 \
|
||
--micro-batch-size 1 --global-batch-size 1024 \
|
||
--bf16 --use-flash-attn
|
||
```
|
||
|
||
### Distributed inference K8s (vLLM + Ray Serve)
|
||
```python
|
||
from ray import serve
|
||
from vllm.entrypoints.openai.api_server import app as vllm_app
|
||
|
||
@serve.deployment(num_replicas=4, ray_actor_options={"num_gpus": 2})
|
||
@serve.ingress(vllm_app)
|
||
class LLMService:
|
||
def __init__(self, model):
|
||
self.engine = AsyncLLMEngine.from_engine_args(
|
||
AsyncEngineArgs(model=model, tensor_parallel_size=2)
|
||
)
|
||
```
|
||
|
||
### KV cache sizing rule of thumb
|
||
```python
|
||
def kv_cache_bytes(layers, hidden_dim, seq_len, batch, dtype_bytes=2, kv_heads=None):
|
||
h = kv_heads or hidden_dim # GQA: kv_heads < hidden
|
||
return 2 * batch * layers * seq_len * h * dtype_bytes
|
||
# Llama-3-70B at 128K context, batch=1, GQA(8): ~10 GB KV
|
||
```
|
||
|
||
### Disaggregated prefill/decode (Splitwise pattern)
|
||
```python
|
||
# Prefill cluster: H100, optimized for compute
|
||
# Decode cluster: H100/L40S, optimized for memory bandwidth
|
||
# KV transfer over NVLink/RDMA between clusters
|
||
prefill_resp = await prefill_client.prefill(prompt) # returns KV blocks ref
|
||
decode_stream = decode_client.decode(kv_ref=prefill_resp.kv_ref, max_tokens=512)
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| <13B model train | DDP / FSDP1 single node 8xGPU |
|
||
| 70B train | FSDP2 / ZeRO-3 multi-node + activation checkpoint |
|
||
| 400B+ MoE | TP + PP + EP + ZeRO-1 (Megatron-Core / NeMo) |
|
||
| Chatbot serving | vLLM + prefix cache + speculative decoding |
|
||
| Long context (1M+) | Ring attention / context parallel |
|
||
| High concurrency | SGLang RadixAttention + disaggregated |
|
||
|
||
**기본값**: FSDP2 for training under 100B; vLLM with FP8 + prefix cache + spec decoding for serving.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Distributed-Systems]]
|
||
- 변형: [[Pipeline-Parallelism]]
|
||
- 응용: [[Fine-tuning]]
|
||
- Adjacent: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Flash-Attention]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: capacity planning estimates, config-file generation (deepspeed, accelerate), debugging OOM via log triage, scaling-law back-of-envelope.
|
||
**언제 X**: actually scheduling kernels — the orchestrator (PyTorch/vLLM) handles deterministic scheduling.
|
||
|
||
## ❌ 안티패턴
|
||
- **All-reduce on tiny gradients**: communication dominates; bucket gradients.
|
||
- **Naive batching for serving**: head-of-line blocking — use continuous batching.
|
||
- **Unbounded KV cache**: OOM at peak; use PagedAttention + admission control.
|
||
- **No prefix cache for shared system prompt**: 5-10x throughput left on table.
|
||
- **PP without enough microbatches**: bubble dominates pipeline.
|
||
- **MoE without expert balance loss**: dead experts, capacity waste.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Megatron-LM/Megatron-Core, vLLM paper SOSP 2023, Chinchilla 2022, NVIDIA H100/B200 perf guides 2026).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — FSDP/ZeRO/TP/PP, vLLM/SGLang, disaggregated serving |
|