"매 model size × data × users × latency 의 4축 동시 만족 — 매 single GPU → multi-node training (FSDP/ZeRO/TP/PP) 과 매 10K+ QPS inference (vLLM/SGLang/PagedAttention) 의 양 갈래". 매 2020 GPT-3 (175B) 부터 매 2026 trillion-param MoE (Llama 4 Behemoth, Claude Opus 4.7, GPT-5) 까지 매 scaling laws (Chinchilla, Hoffmann 2022) 가 산업 의 compass.
매 핵심
매 Training scalability axes
Data parallel (DP): replicate model, shard batch — gradient all-reduce.
Tensor parallel (TP): split single layer across GPUs (Megatron-LM).
Pipeline parallel (PP): split layers across stages (GPipe, 1F1B).
Disaggregated serving: prefill nodes vs decode nodes (Splitwise, DistServe).
매 응용
Pretraining 70B+ on 256+ GPU cluster.
LoRA/QLoRA fine-tune on single H100 (24GB shard).
100K+ concurrent chatbot QPS via vLLM cluster.
RAG at 1M+ docs with sharded vector DB.
Multi-tenant inference with tenant-aware caching.
💻 패턴
FSDP2 with PyTorch (2026 idiom)
importtorchfromtorch.distributed.fsdpimportfully_shard,MixedPrecisionPolicymodel=LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B")mp_policy=MixedPrecisionPolicy(param_dtype=torch.bfloat16,reduce_dtype=torch.float32,)forlayerinmodel.model.layers:fully_shard(layer,mp_policy=mp_policy)fully_shard(model,mp_policy=mp_policy)# Gradient checkpointing for memorymodel.gradient_checkpointing_enable()
importsglangassglsgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))@sgl.functiondefmulti_turn(s,system,turns):s+=sgl.system(system)# cached across all callsfortinturns:s+=sgl.user(t["q"])+sgl.assistant(sgl.gen(max_tokens=256))# Massive throughput when many users share system prompt
# Prefill cluster: H100, optimized for compute# Decode cluster: H100/L40S, optimized for memory bandwidth# KV transfer over NVLink/RDMA between clustersprefill_resp=awaitprefill_client.prefill(prompt)# returns KV blocks refdecode_stream=decode_client.decode(kv_ref=prefill_resp.kv_ref,max_tokens=512)
매 결정 기준
상황
Approach
<13B model train
DDP / FSDP1 single node 8xGPU
70B train
FSDP2 / ZeRO-3 multi-node + activation checkpoint
400B+ MoE
TP + PP + EP + ZeRO-1 (Megatron-Core / NeMo)
Chatbot serving
vLLM + prefix cache + speculative decoding
Long context (1M+)
Ring attention / context parallel
High concurrency
SGLang RadixAttention + disaggregated
기본값: FSDP2 for training under 100B; vLLM with FP8 + prefix cache + spec decoding for serving.