Files
2nd/10_Wiki/Topics/Architecture/Parallel-Computing-in-AI.md
T
2026-05-10 22:08:15 +09:00

181 lines
6.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-parallel-computing-in-ai
title: Parallel Computing in AI
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [AI parallelism, distributed AI training, AI parallel computing]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [parallelism, ai, distributed, gpu, training]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch
---
# Parallel Computing in AI
## 매 한 줄
> **"매 AI 의 parallel 은 단순 distributed 가 아니라 4축 (Data, Tensor, Pipeline, Expert) × 2 모드 (training/inference) 의 조합 problem"**. 매 2026 Llama 3 405B 학습 = 16384 GPU, GPT-5 추론 = 매 expert sharding + speculative decoding. 매 parallel strategy 의 mismatch = 매 ROI 폭락.
## 매 핵심
### 매 4 축
1. **Data Parallel (DP)**: 매 batch 분할, 매 weight 동기.
2. **Tensor Parallel (TP)**: 매 single layer 의 weight matrix 분할.
3. **Pipeline Parallel (PP)**: 매 layer stack 분할.
4. **Expert Parallel (EP)**: 매 MoE 의 expert 분할.
### 매 + 보조 axis
- **Sequence Parallel (SP)**: 매 long context 의 token 차원 분할 (Ring Attention, Ulysses).
- **Context Parallel (CP)**: 매 attention 의 KV partial.
- **ZeRO sharding (DP variant)**: 매 optimizer/grad/param 분할.
### 매 training vs inference
| 측면 | Training | Inference |
|---|---|---|
| 주축 | DP + ZeRO | TP + EP |
| Batch | 큰 micro-batch | 매 1 user → 매 small |
| Memory | activation 중심 | KV cache 중심 |
| Comm | all-reduce (grad) | all-reduce (TP), all-to-all (MoE) |
| 도구 | DeepSpeed, Megatron, FSDP | vLLM, TensorRT-LLM, SGLang |
### 매 hardware constraint
- **NVLink** (intra-node, 900GB/s): 매 TP 친화.
- **InfiniBand / NVLink Switch** (inter-node, 400Gbps NDR): 매 PP/DP.
- **HBM3e** (192GB on B200): 매 더 큰 model fit + 매 sharding 절약.
## 💻 패턴
### FSDP2 (PyTorch 2.6+, 2026)
```python
import torch
from torch.distributed.fsdp import fully_shard, MixedPrecisionPolicy
model = build_transformer()
mp = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)
for layer in model.layers:
fully_shard(layer, mp_policy=mp)
fully_shard(model, mp_policy=mp)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4, fused=True)
for batch in loader:
loss = model(batch).loss
loss.backward()
opt.step(); opt.zero_grad()
```
### TP with DTensor / device_mesh
```python
from torch.distributed.tensor import DTensor, Shard, Replicate
from torch.distributed.device_mesh import init_device_mesh
mesh = init_device_mesh("cuda", (DP, TP), mesh_dim_names=("dp","tp"))
# Column-parallel linear (매 weight col-sharded)
w = nn.Parameter(torch.empty(out, in_))
w_dt = DTensor.from_local(w, mesh["tp"], [Shard(0)])
# 매 forward: x @ w_dt.T → 매 partial output, all-reduce 자동
```
### MoE expert parallel (DeepSpeed-MoE / Megatron)
```python
class MoELayer(nn.Module):
def __init__(self, experts, top_k=2):
super().__init__()
self.experts = nn.ModuleList(experts) # 매 sharded across EP group
self.gate = nn.Linear(d, len(experts))
self.k = top_k
def forward(self, x):
logits = self.gate(x)
topk = logits.topk(self.k, dim=-1)
# 매 all-to-all dispatch tokens to expert-owning rank
dispatched = all_to_all_dispatch(x, topk.indices, ep_group)
out = local_experts(dispatched)
return all_to_all_combine(out, topk.indices, topk.values, ep_group)
```
### vLLM tensor-parallel inference
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=4, # 매 4 GPU TP
pipeline_parallel_size=1,
gpu_memory_utilization=0.92,
enable_chunked_prefill=True,
enable_prefix_caching=True,
)
out = llm.generate(["매 안녕"], SamplingParams(temperature=0.7, max_tokens=128))
```
### Ring Attention (sequence parallel for 1M context)
```python
def ring_attention(q, k, v, sp_group):
# 매 each rank holds 1/N of sequence
out = torch.zeros_like(q); lse = torch.full(...)
k_blk, v_blk = k, v
for step in range(world_size(sp_group)):
partial, l = flash_attn_partial(q, k_blk, v_blk)
out, lse = log_sum_exp_combine(out, lse, partial, l)
k_blk, v_blk = ring_send_recv(k_blk, v_blk, sp_group)
return out
```
### 3D parallel mesh (Megatron-Core 2026)
```python
mesh = init_device_mesh(
"cuda", (DP_replica, DP_shard, PP, TP),
mesh_dim_names=("dp_r","dp_s","pp","tp"),
)
# Llama 3 405B: dp_r=16, dp_s=8, pp=16, tp=8 → 16384 GPUs
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| ≤ 8B model, 1 node | DP + ZeRO-2 |
| 70B, multi-node | FSDP2 (DP shard) + TP intra-node |
| 405B+ | TP × PP × DP (3D) |
| MoE (Mixtral, GPT-5 style) | + EP (all-to-all) |
| 1M+ context | + SP (Ring/Ulysses) |
| Inference 1 user | TP only, no DP |
| Inference batch | TP + continuous batching (vLLM) |
| MoE inference | EP + speculative decoding |
**기본값**: 매 training = FSDP2 + TP, 매 inference = vLLM TP.
## 🔗 Graph
- 부모: [[Distributed Computing]] · [[High Performance Computing]]
- 변형: [[Data Parallelism]] · [[Tensor Parallelism]] · [[Pipeline Parallelism]] · [[Expert Parallelism]]
- 응용: [[LLM Training]] · [[vLLM]] · [[DeepSpeed]] · [[Megatron-LM]]
- Adjacent: [[ZeRO Optimizer]] · [[FSDP]] · [[Ring Attention]] · [[Mixture of Experts]]
## 🤖 LLM 활용
**언제**: 매 model 또는 batch 가 single-GPU mem 초과. 매 throughput / latency SLO 의 multi-GPU 필요.
**언제 X**: 매 single GPU fits + 매 batch latency 만족 — 매 multi-GPU overhead 가 손해.
## ❌ 안티패턴
- **TP across nodes**: 매 IB 의 NVLink 대비 5-10x 느림 → 매 stall.
- **PP without DP**: 매 batch 의 micro-batch 한계 → 매 throughput cap.
- **MoE EP without all-to-all opt**: 매 NCCL all-to-all 의 성능 ↓ → 매 GroupedGEMM kernel 필요.
- **Sequence parallel without flash-attn**: 매 attention recompute 폭증.
- **Mixed precision without loss scaling** (FP16): 매 underflow → 매 loss NaN. → BF16 권장.
## 🧪 검증 / 중복
- Verified (Megatron-LM paper, FSDP2 docs 2026, vLLM docs, DeepSpeed-MoE paper).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — 4-axis parallelism + training/inference split |