Files
2nd/10_Wiki/Topics/Architecture/Pipeline-Parallelism.md
T
2026-05-10 22:08:15 +09:00

169 lines
5.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-pipeline-parallelism
title: Pipeline Parallelism
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [PP, GPipe, 1F1B]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [parallelism, distributed-training, deep-learning, llm]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch
---
# Pipeline Parallelism
## 매 한 줄
> **"매 모델을 layer-wise로 잘라 GPU pipeline 위로 micro-batch가 흐르게 한다"**. 매 GPipe(2018)에서 시작, PipeDream / 1F1B / Interleaved 1F1B로 진화. 매 2026 LLM 학습(>100B params)에서 TP+PP+DP 조합의 한 축.
## 매 핵심
### 매 왜 PP인가
- 매 단일 GPU 의 memory(HBM3 80192GB) 의 초과 → layer 분할 필수.
- 매 Tensor Parallelism 의 NVLink 안 high-bandwidth requirement → 매 node 간 한계.
- 매 Pipeline Parallelism 의 stage 간 activation 만 전달 → 매 inter-node OK.
### 매 stage / micro-batch
- Stage = 매 연속 layer 묶음, GPU 1개 차지.
- Mini-batch 의 micro-batch K개로 split → 매 동시에 다른 stage에서 처리.
- Bubble = 매 idle time. Bubble ratio ≈ (stages - 1) / K.
### 매 schedule 계열
1. **GPipe**: 매 forward all → backward all. 매 simple, 큰 bubble.
2. **1F1B (PipeDream)**: 매 1 forward, 1 backward 교대. 매 activation memory 절감.
3. **Interleaved 1F1B (Megatron)**: 매 stage 마다 여러 chunk → bubble 감소.
4. **Zero Bubble PP (2024)**: 매 backward를 W/B로 split → 매 거의 0 bubble.
## 💻 패턴
### PyTorch native PipelineStage (torch.distributed.pipelining)
```python
import torch
import torch.nn as nn
from torch.distributed.pipelining import pipeline, ScheduleGPipe, SplitPoint
class Block(nn.Module):
def __init__(self, d): super().__init__(); self.l = nn.Linear(d, d)
def forward(self, x): return torch.relu(self.l(x))
class Net(nn.Module):
def __init__(self):
super().__init__()
self.b1 = Block(1024); self.b2 = Block(1024)
self.b3 = Block(1024); self.b4 = Block(1024)
def forward(self, x):
return self.b4(self.b3(self.b2(self.b1(x))))
model = Net()
example = torch.randn(8, 1024)
pipe = pipeline(
model, mb_args=(example,),
split_spec={"b3": SplitPoint.BEGINNING}, # stage0: b1-b2, stage1: b3-b4
)
stage = pipe.build_stage(stage_index=rank, device=f"cuda:{rank}")
sched = ScheduleGPipe(stage, n_microbatches=4, loss_fn=nn.MSELoss())
```
### 1F1B schedule 계산
```python
def schedule_1f1b(num_stages: int, num_microbatches: int):
"""매 stage 별 forward/backward 순서 emit"""
ops = [[] for _ in range(num_stages)]
warmup = num_stages
for s in range(num_stages):
n_warm = min(warmup - s, num_microbatches)
for mb in range(n_warm):
ops[s].append(("F", mb))
for mb in range(num_microbatches - n_warm):
ops[s].append(("F", n_warm + mb))
ops[s].append(("B", mb))
for mb in range(num_microbatches - n_warm, num_microbatches):
ops[s].append(("B", mb))
return ops
```
### Megatron-LM virtual pipeline
```python
# v_chunks=2 → stage0 holds {layer 0-7, layer 16-23}, stage1 holds {8-15, 24-31}
config = TransformerConfig(
num_layers=32, hidden_size=8192,
pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=2, # interleaved chunks
num_microbatches=64,
)
```
### Activation recompute (memory bubble 완화)
```python
from torch.utils.checkpoint import checkpoint
class CheckpointedBlock(nn.Module):
def forward(self, x):
return checkpoint(self._fwd, x, use_reentrant=False)
def _fwd(self, x): return self.attn(self.norm(x)) + x
```
### DeepSpeed PipelineModule
```python
import deepspeed
from deepspeed.pipe import PipelineModule, LayerSpec
specs = [LayerSpec(Block, 1024) for _ in range(8)]
model = PipelineModule(layers=specs, num_stages=4, partition_method="uniform")
engine, _, _, _ = deepspeed.initialize(model=model, config=ds_config)
loss = engine.train_batch(data_iter)
```
### 3D parallelism (TP × PP × DP)
```python
# 매 Megatron / NeMo 의 conventional layout
# world_size = TP × PP × DP
# Llama 3 405B 학습: TP=8, PP=16, DP=128 → 16384 GPUs
mesh = init_device_mesh("cuda", (DP, PP, TP), mesh_dim_names=("dp","pp","tp"))
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| 매 single node, ≤8 GPU | TP only (NVLink) |
| 매 multi-node, model > node mem | TP intra-node + PP inter-node |
| 매 100B+ params | TP × PP × DP (3D) |
| 매 inference latency 중요 | TP > PP (PP의 bubble 손해) |
| 매 throughput 중심 training | PP + DP 큰 micro-batch |
**기본값**: 매 LLM 학습은 1F1B + activation recompute + 3D parallel.
## 🔗 Graph
- 부모: [[Distributed Training]] · [[Model Parallelism]]
- 변형: [[Tensor Parallelism]] · [[Sequence Parallelism]] · [[Zero Bubble Pipeline]]
- 응용: [[Megatron-LM]] · [[DeepSpeed]] · [[LLM Training]]
- Adjacent: [[Data Parallelism]] · [[ZeRO Optimizer]] · [[FSDP]]
## 🤖 LLM 활용
**언제**: 매 모델 weight 가 단일 GPU mem 초과 + 매 multi-node training. 매 cross-node bandwidth 가 TP에 부족할 때.
**언제 X**: 매 단일 node 안 fits. 매 매우 작은 batch (bubble 비율 폭증). 매 inference latency-critical.
## ❌ 안티패턴
- **Bubble ignore**: 매 micro-batch K=1 → 매 GPU의 (stages-1)/stages 가 idle.
- **Uneven partition**: 매 stage 별 FLOPs 불균형 → 매 가장 느린 stage 가 throughput 결정.
- **PP only no DP**: 매 K 늘려도 batch size 한계 → 매 DP 병행 필수.
- **Embedding 분리 무시**: 매 input/output embedding 의 같은 stage 배치 → tied weight sync 단순.
## 🧪 검증 / 중복
- Verified (Megatron-LM paper, GPipe, PipeDream, PyTorch pipelining docs 2026).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — PP schedules + 3D parallel patterns |