Files
2nd/10_Wiki/Topics/AI_and_ML/Scalability-in-AI-Systems.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

209 lines
7.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-scalability-in-ai-systems
title: Scalability in AI Systems
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [AI Scalability, Distributed Training, LLM Scaling]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [scalability, distributed-training, inference, vllm, fsdp, llm]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch / vLLM / DeepSpeed
---
# Scalability in AI Systems
## 매 한 줄
> **"매 model size × data × users × latency 의 4축 동시 만족 — 매 single GPU → multi-node training (FSDP/ZeRO/TP/PP) 과 매 10K+ QPS inference (vLLM/SGLang/PagedAttention) 의 양 갈래"**. 매 2020 GPT-3 (175B) 부터 매 2026 trillion-param MoE (Llama 4 Behemoth, Claude Opus 4.7, GPT-5) 까지 매 scaling laws (Chinchilla, Hoffmann 2022) 가 산업 의 compass.
## 매 핵심
### 매 Training scalability axes
- **Data parallel (DP)**: replicate model, shard batch — gradient all-reduce.
- **Tensor parallel (TP)**: split single layer across GPUs (Megatron-LM).
- **Pipeline parallel (PP)**: split layers across stages (GPipe, 1F1B).
- **FSDP / ZeRO**: shard params + grads + optimizer state (ZeRO-1/2/3).
- **Sequence/Context parallel**: shard along sequence (Ring Attention, DeepSpeed-Ulysses).
- **MoE expert parallel**: route tokens to expert subsets across GPUs.
### 매 Inference scalability
- **Continuous batching**: vLLM / TGI — token-level scheduling, no head-of-line block.
- **PagedAttention**: KV cache paged like virtual memory → high concurrency.
- **Prefix caching**: shared system prompt cache (vLLM, SGLang).
- **Speculative decoding**: small draft model proposes, large verifies → 2-3x.
- **Quantization**: FP8, INT4 (AWQ, GPTQ), MX formats (MXFP4) — 2026 H200/B200 native.
- **Disaggregated serving**: prefill nodes vs decode nodes (Splitwise, DistServe).
### 매 응용
1. Pretraining 70B+ on 256+ GPU cluster.
2. LoRA/QLoRA fine-tune on single H100 (24GB shard).
3. 100K+ concurrent chatbot QPS via vLLM cluster.
4. RAG at 1M+ docs with sharded vector DB.
5. Multi-tenant inference with tenant-aware caching.
## 💻 패턴
### FSDP2 with PyTorch (2026 idiom)
```python
import torch
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B")
mp_policy = MixedPrecisionPolicy(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32,
)
for layer in model.model.layers:
fully_shard(layer, mp_policy=mp_policy)
fully_shard(model, mp_policy=mp_policy)
# Gradient checkpointing for memory
model.gradient_checkpointing_enable()
```
### DeepSpeed ZeRO-3 config
```json
{
"train_batch_size": 1024,
"gradient_accumulation_steps": 4,
"bf16": {"enabled": true},
"zero_optimization": {
"stage": 3,
"offload_param": {"device": "cpu"},
"offload_optimizer": {"device": "nvme", "nvme_path": "/mnt/nvme"},
"overlap_comm": true,
"contiguous_gradients": true
}
}
```
### vLLM serving (continuous batching + prefix cache)
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
enable_prefix_caching=True,
max_model_len=128000,
quantization="fp8",
gpu_memory_utilization=0.92,
)
out = llm.generate(prompts, SamplingParams(max_tokens=512, temperature=0.7))
```
### SGLang RadixAttention (shared prefix tree)
```python
import sglang as sgl
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
@sgl.function
def multi_turn(s, system, turns):
s += sgl.system(system) # cached across all calls
for t in turns:
s += sgl.user(t["q"]) + sgl.assistant(sgl.gen(max_tokens=256))
# Massive throughput when many users share system prompt
```
### Speculative decoding (vLLM)
```python
llm = LLM(
model="meta-llama/Llama-3.1-70B-Instruct",
speculative_model="meta-llama/Llama-3.2-1B-Instruct",
num_speculative_tokens=5,
use_v2_block_manager=True,
)
```
### Megatron-LM tensor + pipeline parallel
```bash
torchrun --nproc_per_node=8 --nnodes=8 \
pretrain_gpt.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 8 \
--num-layers 80 --hidden-size 8192 \
--num-attention-heads 64 --seq-length 8192 \
--micro-batch-size 1 --global-batch-size 1024 \
--bf16 --use-flash-attn
```
### Distributed inference K8s (vLLM + Ray Serve)
```python
from ray import serve
from vllm.entrypoints.openai.api_server import app as vllm_app
@serve.deployment(num_replicas=4, ray_actor_options={"num_gpus": 2})
@serve.ingress(vllm_app)
class LLMService:
def __init__(self, model):
self.engine = AsyncLLMEngine.from_engine_args(
AsyncEngineArgs(model=model, tensor_parallel_size=2)
)
```
### KV cache sizing rule of thumb
```python
def kv_cache_bytes(layers, hidden_dim, seq_len, batch, dtype_bytes=2, kv_heads=None):
h = kv_heads or hidden_dim # GQA: kv_heads < hidden
return 2 * batch * layers * seq_len * h * dtype_bytes
# Llama-3-70B at 128K context, batch=1, GQA(8): ~10 GB KV
```
### Disaggregated prefill/decode (Splitwise pattern)
```python
# Prefill cluster: H100, optimized for compute
# Decode cluster: H100/L40S, optimized for memory bandwidth
# KV transfer over NVLink/RDMA between clusters
prefill_resp = await prefill_client.prefill(prompt) # returns KV blocks ref
decode_stream = decode_client.decode(kv_ref=prefill_resp.kv_ref, max_tokens=512)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| <13B model train | DDP / FSDP1 single node 8xGPU |
| 70B train | FSDP2 / ZeRO-3 multi-node + activation checkpoint |
| 400B+ MoE | TP + PP + EP + ZeRO-1 (Megatron-Core / NeMo) |
| Chatbot serving | vLLM + prefix cache + speculative decoding |
| Long context (1M+) | Ring attention / context parallel |
| High concurrency | SGLang RadixAttention + disaggregated |
**기본값**: FSDP2 for training under 100B; vLLM with FP8 + prefix cache + spec decoding for serving.
## 🔗 Graph
- 부모: [[Distributed-Systems]]
- 변형: [[Pipeline-Parallelism]]
- 응용: [[Fine-tuning]]
- Adjacent: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Flash Attention]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]]
## 🤖 LLM 활용
**언제**: capacity planning estimates, config-file generation (deepspeed, accelerate), debugging OOM via log triage, scaling-law back-of-envelope.
**언제 X**: actually scheduling kernels — the orchestrator (PyTorch/vLLM) handles deterministic scheduling.
## ❌ 안티패턴
- **All-reduce on tiny gradients**: communication dominates; bucket gradients.
- **Naive batching for serving**: head-of-line blocking — use continuous batching.
- **Unbounded KV cache**: OOM at peak; use PagedAttention + admission control.
- **No prefix cache for shared system prompt**: 5-10x throughput left on table.
- **PP without enough microbatches**: bubble dominates pipeline.
- **MoE without expert balance loss**: dead experts, capacity waste.
## 🧪 검증 / 중복
- Verified (Megatron-LM/Megatron-Core, vLLM paper SOSP 2023, Chinchilla 2022, NVIDIA H100/B200 perf guides 2026).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — FSDP/ZeRO/TP/PP, vLLM/SGLang, disaggregated serving |