"매 AI 의 parallel 은 단순 distributed 가 아니라 4축 (Data, Tensor, Pipeline, Expert) × 2 모드 (training/inference) 의 조합 problem". 매 2026 Llama 3 405B 학습 = 16384 GPU, GPT-5 추론 = 매 expert sharding + speculative decoding. 매 parallel strategy 의 mismatch = 매 ROI 폭락.
매 핵심
매 4 축
Data Parallel (DP): 매 batch 분할, 매 weight 동기.
Tensor Parallel (TP): 매 single layer 의 weight matrix 분할.
Pipeline Parallel (PP): 매 layer stack 분할.
Expert Parallel (EP): 매 MoE 의 expert 분할.
매 + 보조 axis
Sequence Parallel (SP): 매 long context 의 token 차원 분할 (Ring Attention, Ulysses).
Context Parallel (CP): 매 attention 의 KV partial.
ZeRO sharding (DP variant): 매 optimizer/grad/param 분할.
매 training vs inference
측면
Training
Inference
주축
DP + ZeRO
TP + EP
Batch
큰 micro-batch
매 1 user → 매 small
Memory
activation 중심
KV cache 중심
Comm
all-reduce (grad)
all-reduce (TP), all-to-all (MoE)
도구
DeepSpeed, Megatron, FSDP
vLLM, TensorRT-LLM, SGLang
매 hardware constraint
NVLink (intra-node, 900GB/s): 매 TP 친화.
InfiniBand / NVLink Switch (inter-node, 400Gbps NDR): 매 PP/DP.
HBM3e (192GB on B200): 매 더 큰 model fit + 매 sharding 절약.
fromtorch.distributed.tensorimportDTensor,Shard,Replicatefromtorch.distributed.device_meshimportinit_device_meshmesh=init_device_mesh("cuda",(DP,TP),mesh_dim_names=("dp","tp"))# Column-parallel linear (매 weight col-sharded)w=nn.Parameter(torch.empty(out,in_))w_dt=DTensor.from_local(w,mesh["tp"],[Shard(0)])# 매 forward: x @ w_dt.T → 매 partial output, all-reduce 자동
MoE expert parallel (DeepSpeed-MoE / Megatron)
classMoELayer(nn.Module):def__init__(self,experts,top_k=2):super().__init__()self.experts=nn.ModuleList(experts)# 매 sharded across EP groupself.gate=nn.Linear(d,len(experts))self.k=top_kdefforward(self,x):logits=self.gate(x)topk=logits.topk(self.k,dim=-1)# 매 all-to-all dispatch tokens to expert-owning rankdispatched=all_to_all_dispatch(x,topk.indices,ep_group)out=local_experts(dispatched)returnall_to_all_combine(out,topk.indices,topk.values,ep_group)
vLLM tensor-parallel inference
fromvllmimportLLM,SamplingParamsllm=LLM(model="meta-llama/Llama-3.3-70B-Instruct",tensor_parallel_size=4,# 매 4 GPU TPpipeline_parallel_size=1,gpu_memory_utilization=0.92,enable_chunked_prefill=True,enable_prefix_caching=True,)out=llm.generate(["매 안녕"],SamplingParams(temperature=0.7,max_tokens=128))
Ring Attention (sequence parallel for 1M context)
defring_attention(q,k,v,sp_group):# 매 each rank holds 1/N of sequenceout=torch.zeros_like(q);lse=torch.full(...)k_blk,v_blk=k,vforstepinrange(world_size(sp_group)):partial,l=flash_attn_partial(q,k_blk,v_blk)out,lse=log_sum_exp_combine(out,lse,partial,l)k_blk,v_blk=ring_send_recv(k_blk,v_blk,sp_group)returnout
언제: 매 model 또는 batch 가 single-GPU mem 초과. 매 throughput / latency SLO 의 multi-GPU 필요.
언제 X: 매 single GPU fits + 매 batch latency 만족 — 매 multi-GPU overhead 가 손해.
❌ 안티패턴
TP across nodes: 매 IB 의 NVLink 대비 5-10x 느림 → 매 stall.
PP without DP: 매 batch 의 micro-batch 한계 → 매 throughput cap.
MoE EP without all-to-all opt: 매 NCCL all-to-all 의 성능 ↓ → 매 GroupedGEMM kernel 필요.
Sequence parallel without flash-attn: 매 attention recompute 폭증.
Mixed precision without loss scaling (FP16): 매 underflow → 매 loss NaN. → BF16 권장.