--- id: wiki-2026-0508-scalability-in-ai-systems title: Scalability in AI Systems category: 10_Wiki/Topics status: verified canonical_id: self aliases: [AI Scalability, Distributed Training, LLM Scaling] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [scalability, distributed-training, inference, vllm, fsdp, llm] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch / vLLM / DeepSpeed --- # Scalability in AI Systems ## 매 한 줄 > **"매 model size × data × users × latency 의 4축 동시 만족 — 매 single GPU → multi-node training (FSDP/ZeRO/TP/PP) 과 매 10K+ QPS inference (vLLM/SGLang/PagedAttention) 의 양 갈래"**. 매 2020 GPT-3 (175B) 부터 매 2026 trillion-param MoE (Llama 4 Behemoth, Claude Opus 4.7, GPT-5) 까지 매 scaling laws (Chinchilla, Hoffmann 2022) 가 산업 의 compass. ## 매 핵심 ### 매 Training scalability axes - **Data parallel (DP)**: replicate model, shard batch — gradient all-reduce. - **Tensor parallel (TP)**: split single layer across GPUs (Megatron-LM). - **Pipeline parallel (PP)**: split layers across stages (GPipe, 1F1B). - **FSDP / ZeRO**: shard params + grads + optimizer state (ZeRO-1/2/3). - **Sequence/Context parallel**: shard along sequence (Ring Attention, DeepSpeed-Ulysses). - **MoE expert parallel**: route tokens to expert subsets across GPUs. ### 매 Inference scalability - **Continuous batching**: vLLM / TGI — token-level scheduling, no head-of-line block. - **PagedAttention**: KV cache paged like virtual memory → high concurrency. - **Prefix caching**: shared system prompt cache (vLLM, SGLang). - **Speculative decoding**: small draft model proposes, large verifies → 2-3x. - **Quantization**: FP8, INT4 (AWQ, GPTQ), MX formats (MXFP4) — 2026 H200/B200 native. - **Disaggregated serving**: prefill nodes vs decode nodes (Splitwise, DistServe). ### 매 응용 1. Pretraining 70B+ on 256+ GPU cluster. 2. LoRA/QLoRA fine-tune on single H100 (24GB shard). 3. 100K+ concurrent chatbot QPS via vLLM cluster. 4. RAG at 1M+ docs with sharded vector DB. 5. Multi-tenant inference with tenant-aware caching. ## 💻 패턴 ### FSDP2 with PyTorch (2026 idiom) ```python import torch from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B") mp_policy = MixedPrecisionPolicy( param_dtype=torch.bfloat16, reduce_dtype=torch.float32, ) for layer in model.model.layers: fully_shard(layer, mp_policy=mp_policy) fully_shard(model, mp_policy=mp_policy) # Gradient checkpointing for memory model.gradient_checkpointing_enable() ``` ### DeepSpeed ZeRO-3 config ```json { "train_batch_size": 1024, "gradient_accumulation_steps": 4, "bf16": {"enabled": true}, "zero_optimization": { "stage": 3, "offload_param": {"device": "cpu"}, "offload_optimizer": {"device": "nvme", "nvme_path": "/mnt/nvme"}, "overlap_comm": true, "contiguous_gradients": true } } ``` ### vLLM serving (continuous batching + prefix cache) ```python from vllm import LLM, SamplingParams llm = LLM( model="meta-llama/Llama-3.1-70B-Instruct", tensor_parallel_size=4, enable_prefix_caching=True, max_model_len=128000, quantization="fp8", gpu_memory_utilization=0.92, ) out = llm.generate(prompts, SamplingParams(max_tokens=512, temperature=0.7)) ``` ### SGLang RadixAttention (shared prefix tree) ```python import sglang as sgl sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000")) @sgl.function def multi_turn(s, system, turns): s += sgl.system(system) # cached across all calls for t in turns: s += sgl.user(t["q"]) + sgl.assistant(sgl.gen(max_tokens=256)) # Massive throughput when many users share system prompt ``` ### Speculative decoding (vLLM) ```python llm = LLM( model="meta-llama/Llama-3.1-70B-Instruct", speculative_model="meta-llama/Llama-3.2-1B-Instruct", num_speculative_tokens=5, use_v2_block_manager=True, ) ``` ### Megatron-LM tensor + pipeline parallel ```bash torchrun --nproc_per_node=8 --nnodes=8 \ pretrain_gpt.py \ --tensor-model-parallel-size 8 \ --pipeline-model-parallel-size 8 \ --num-layers 80 --hidden-size 8192 \ --num-attention-heads 64 --seq-length 8192 \ --micro-batch-size 1 --global-batch-size 1024 \ --bf16 --use-flash-attn ``` ### Distributed inference K8s (vLLM + Ray Serve) ```python from ray import serve from vllm.entrypoints.openai.api_server import app as vllm_app @serve.deployment(num_replicas=4, ray_actor_options={"num_gpus": 2}) @serve.ingress(vllm_app) class LLMService: def __init__(self, model): self.engine = AsyncLLMEngine.from_engine_args( AsyncEngineArgs(model=model, tensor_parallel_size=2) ) ``` ### KV cache sizing rule of thumb ```python def kv_cache_bytes(layers, hidden_dim, seq_len, batch, dtype_bytes=2, kv_heads=None): h = kv_heads or hidden_dim # GQA: kv_heads < hidden return 2 * batch * layers * seq_len * h * dtype_bytes # Llama-3-70B at 128K context, batch=1, GQA(8): ~10 GB KV ``` ### Disaggregated prefill/decode (Splitwise pattern) ```python # Prefill cluster: H100, optimized for compute # Decode cluster: H100/L40S, optimized for memory bandwidth # KV transfer over NVLink/RDMA between clusters prefill_resp = await prefill_client.prefill(prompt) # returns KV blocks ref decode_stream = decode_client.decode(kv_ref=prefill_resp.kv_ref, max_tokens=512) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | <13B model train | DDP / FSDP1 single node 8xGPU | | 70B train | FSDP2 / ZeRO-3 multi-node + activation checkpoint | | 400B+ MoE | TP + PP + EP + ZeRO-1 (Megatron-Core / NeMo) | | Chatbot serving | vLLM + prefix cache + speculative decoding | | Long context (1M+) | Ring attention / context parallel | | High concurrency | SGLang RadixAttention + disaggregated | **기본값**: FSDP2 for training under 100B; vLLM with FP8 + prefix cache + spec decoding for serving. ## 🔗 Graph - 부모: [[Distributed-Systems]] - 변형: [[Pipeline-Parallelism]] - 응용: [[Fine-tuning]] - Adjacent: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Flash Attention]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]] ## 🤖 LLM 활용 **언제**: capacity planning estimates, config-file generation (deepspeed, accelerate), debugging OOM via log triage, scaling-law back-of-envelope. **언제 X**: actually scheduling kernels — the orchestrator (PyTorch/vLLM) handles deterministic scheduling. ## ❌ 안티패턴 - **All-reduce on tiny gradients**: communication dominates; bucket gradients. - **Naive batching for serving**: head-of-line blocking — use continuous batching. - **Unbounded KV cache**: OOM at peak; use PagedAttention + admission control. - **No prefix cache for shared system prompt**: 5-10x throughput left on table. - **PP without enough microbatches**: bubble dominates pipeline. - **MoE without expert balance loss**: dead experts, capacity waste. ## 🧪 검증 / 중복 - Verified (Megatron-LM/Megatron-Core, vLLM paper SOSP 2023, Chinchilla 2022, NVIDIA H100/B200 perf guides 2026). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — FSDP/ZeRO/TP/PP, vLLM/SGLang, disaggregated serving |