Files
2nd/10_Wiki/Topics/AI_and_ML/Scalability-in-AI-Systems.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

7.3 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-scalability-in-ai-systems Scalability in AI Systems 10_Wiki/Topics verified self
AI Scalability
Distributed Training
LLM Scaling
none A 0.9 applied
scalability
distributed-training
inference
vllm
fsdp
llm
2026-05-10 pending
language framework
Python PyTorch / vLLM / DeepSpeed

Scalability in AI Systems

매 한 줄

"매 model size × data × users × latency 의 4축 동시 만족 — 매 single GPU → multi-node training (FSDP/ZeRO/TP/PP) 과 매 10K+ QPS inference (vLLM/SGLang/PagedAttention) 의 양 갈래". 매 2020 GPT-3 (175B) 부터 매 2026 trillion-param MoE (Llama 4 Behemoth, Claude Opus 4.7, GPT-5) 까지 매 scaling laws (Chinchilla, Hoffmann 2022) 가 산업 의 compass.

매 핵심

매 Training scalability axes

  • Data parallel (DP): replicate model, shard batch — gradient all-reduce.
  • Tensor parallel (TP): split single layer across GPUs (Megatron-LM).
  • Pipeline parallel (PP): split layers across stages (GPipe, 1F1B).
  • FSDP / ZeRO: shard params + grads + optimizer state (ZeRO-1/2/3).
  • Sequence/Context parallel: shard along sequence (Ring Attention, DeepSpeed-Ulysses).
  • MoE expert parallel: route tokens to expert subsets across GPUs.

매 Inference scalability

  • Continuous batching: vLLM / TGI — token-level scheduling, no head-of-line block.
  • PagedAttention: KV cache paged like virtual memory → high concurrency.
  • Prefix caching: shared system prompt cache (vLLM, SGLang).
  • Speculative decoding: small draft model proposes, large verifies → 2-3x.
  • Quantization: FP8, INT4 (AWQ, GPTQ), MX formats (MXFP4) — 2026 H200/B200 native.
  • Disaggregated serving: prefill nodes vs decode nodes (Splitwise, DistServe).

매 응용

  1. Pretraining 70B+ on 256+ GPU cluster.
  2. LoRA/QLoRA fine-tune on single H100 (24GB shard).
  3. 100K+ concurrent chatbot QPS via vLLM cluster.
  4. RAG at 1M+ docs with sharded vector DB.
  5. Multi-tenant inference with tenant-aware caching.

💻 패턴

FSDP2 with PyTorch (2026 idiom)

import torch
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy

model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B")
mp_policy = MixedPrecisionPolicy(
    param_dtype=torch.bfloat16,
    reduce_dtype=torch.float32,
)
for layer in model.model.layers:
    fully_shard(layer, mp_policy=mp_policy)
fully_shard(model, mp_policy=mp_policy)

# Gradient checkpointing for memory
model.gradient_checkpointing_enable()

DeepSpeed ZeRO-3 config

{
  "train_batch_size": 1024,
  "gradient_accumulation_steps": 4,
  "bf16": {"enabled": true},
  "zero_optimization": {
    "stage": 3,
    "offload_param": {"device": "cpu"},
    "offload_optimizer": {"device": "nvme", "nvme_path": "/mnt/nvme"},
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

vLLM serving (continuous batching + prefix cache)

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    enable_prefix_caching=True,
    max_model_len=128000,
    quantization="fp8",
    gpu_memory_utilization=0.92,
)
out = llm.generate(prompts, SamplingParams(max_tokens=512, temperature=0.7))

SGLang RadixAttention (shared prefix tree)

import sglang as sgl

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

@sgl.function
def multi_turn(s, system, turns):
    s += sgl.system(system)  # cached across all calls
    for t in turns:
        s += sgl.user(t["q"]) + sgl.assistant(sgl.gen(max_tokens=256))

# Massive throughput when many users share system prompt

Speculative decoding (vLLM)

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    speculative_model="meta-llama/Llama-3.2-1B-Instruct",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

Megatron-LM tensor + pipeline parallel

torchrun --nproc_per_node=8 --nnodes=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 8 \
  --pipeline-model-parallel-size 8 \
  --num-layers 80 --hidden-size 8192 \
  --num-attention-heads 64 --seq-length 8192 \
  --micro-batch-size 1 --global-batch-size 1024 \
  --bf16 --use-flash-attn

Distributed inference K8s (vLLM + Ray Serve)

from ray import serve
from vllm.entrypoints.openai.api_server import app as vllm_app

@serve.deployment(num_replicas=4, ray_actor_options={"num_gpus": 2})
@serve.ingress(vllm_app)
class LLMService:
    def __init__(self, model):
        self.engine = AsyncLLMEngine.from_engine_args(
            AsyncEngineArgs(model=model, tensor_parallel_size=2)
        )

KV cache sizing rule of thumb

def kv_cache_bytes(layers, hidden_dim, seq_len, batch, dtype_bytes=2, kv_heads=None):
    h = kv_heads or hidden_dim  # GQA: kv_heads < hidden
    return 2 * batch * layers * seq_len * h * dtype_bytes
# Llama-3-70B at 128K context, batch=1, GQA(8): ~10 GB KV

Disaggregated prefill/decode (Splitwise pattern)

# Prefill cluster: H100, optimized for compute
# Decode cluster: H100/L40S, optimized for memory bandwidth
# KV transfer over NVLink/RDMA between clusters
prefill_resp = await prefill_client.prefill(prompt)  # returns KV blocks ref
decode_stream = decode_client.decode(kv_ref=prefill_resp.kv_ref, max_tokens=512)

매 결정 기준

상황 Approach
<13B model train DDP / FSDP1 single node 8xGPU
70B train FSDP2 / ZeRO-3 multi-node + activation checkpoint
400B+ MoE TP + PP + EP + ZeRO-1 (Megatron-Core / NeMo)
Chatbot serving vLLM + prefix cache + speculative decoding
Long context (1M+) Ring attention / context parallel
High concurrency SGLang RadixAttention + disaggregated

기본값: FSDP2 for training under 100B; vLLM with FP8 + prefix cache + spec decoding for serving.

🔗 Graph

🤖 LLM 활용

언제: capacity planning estimates, config-file generation (deepspeed, accelerate), debugging OOM via log triage, scaling-law back-of-envelope. 언제 X: actually scheduling kernels — the orchestrator (PyTorch/vLLM) handles deterministic scheduling.

안티패턴

  • All-reduce on tiny gradients: communication dominates; bucket gradients.
  • Naive batching for serving: head-of-line blocking — use continuous batching.
  • Unbounded KV cache: OOM at peak; use PagedAttention + admission control.
  • No prefix cache for shared system prompt: 5-10x throughput left on table.
  • PP without enough microbatches: bubble dominates pipeline.
  • MoE without expert balance loss: dead experts, capacity waste.

🧪 검증 / 중복

  • Verified (Megatron-LM/Megatron-Core, vLLM paper SOSP 2023, Chinchilla 2022, NVIDIA H100/B200 perf guides 2026).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — FSDP/ZeRO/TP/PP, vLLM/SGLang, disaggregated serving