Files
2nd/10_Wiki/Topics/Architecture/Pipeline-Parallelism.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

166 lines
5.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-pipeline-parallelism
title: Pipeline Parallelism
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [PP, GPipe, 1F1B]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [parallelism, distributed-training, deep-learning, llm]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch
---
# Pipeline Parallelism
## 매 한 줄
> **"매 모델을 layer-wise로 잘라 GPU pipeline 위로 micro-batch가 흐르게 한다"**. 매 GPipe(2018)에서 시작, PipeDream / 1F1B / Interleaved 1F1B로 진화. 매 2026 LLM 학습(>100B params)에서 TP+PP+DP 조합의 한 축.
## 매 핵심
### 매 왜 PP인가
- 매 단일 GPU 의 memory(HBM3 80192GB) 의 초과 → layer 분할 필수.
- 매 Tensor Parallelism 의 NVLink 안 high-bandwidth requirement → 매 node 간 한계.
- 매 Pipeline Parallelism 의 stage 간 activation 만 전달 → 매 inter-node OK.
### 매 stage / micro-batch
- Stage = 매 연속 layer 묶음, GPU 1개 차지.
- Mini-batch 의 micro-batch K개로 split → 매 동시에 다른 stage에서 처리.
- Bubble = 매 idle time. Bubble ratio ≈ (stages - 1) / K.
### 매 schedule 계열
1. **GPipe**: 매 forward all → backward all. 매 simple, 큰 bubble.
2. **1F1B (PipeDream)**: 매 1 forward, 1 backward 교대. 매 activation memory 절감.
3. **Interleaved 1F1B (Megatron)**: 매 stage 마다 여러 chunk → bubble 감소.
4. **Zero Bubble PP (2024)**: 매 backward를 W/B로 split → 매 거의 0 bubble.
## 💻 패턴
### PyTorch native PipelineStage (torch.distributed.pipelining)
```python
import torch
import torch.nn as nn
from torch.distributed.pipelining import pipeline, ScheduleGPipe, SplitPoint
class Block(nn.Module):
def __init__(self, d): super().__init__(); self.l = nn.Linear(d, d)
def forward(self, x): return torch.relu(self.l(x))
class Net(nn.Module):
def __init__(self):
super().__init__()
self.b1 = Block(1024); self.b2 = Block(1024)
self.b3 = Block(1024); self.b4 = Block(1024)
def forward(self, x):
return self.b4(self.b3(self.b2(self.b1(x))))
model = Net()
example = torch.randn(8, 1024)
pipe = pipeline(
model, mb_args=(example,),
split_spec={"b3": SplitPoint.BEGINNING}, # stage0: b1-b2, stage1: b3-b4
)
stage = pipe.build_stage(stage_index=rank, device=f"cuda:{rank}")
sched = ScheduleGPipe(stage, n_microbatches=4, loss_fn=nn.MSELoss())
```
### 1F1B schedule 계산
```python
def schedule_1f1b(num_stages: int, num_microbatches: int):
"""매 stage 별 forward/backward 순서 emit"""
ops = [[] for _ in range(num_stages)]
warmup = num_stages
for s in range(num_stages):
n_warm = min(warmup - s, num_microbatches)
for mb in range(n_warm):
ops[s].append(("F", mb))
for mb in range(num_microbatches - n_warm):
ops[s].append(("F", n_warm + mb))
ops[s].append(("B", mb))
for mb in range(num_microbatches - n_warm, num_microbatches):
ops[s].append(("B", mb))
return ops
```
### Megatron-LM virtual pipeline
```python
# v_chunks=2 → stage0 holds {layer 0-7, layer 16-23}, stage1 holds {8-15, 24-31}
config = TransformerConfig(
num_layers=32, hidden_size=8192,
pipeline_model_parallel_size=4,
virtual_pipeline_model_parallel_size=2, # interleaved chunks
num_microbatches=64,
)
```
### Activation recompute (memory bubble 완화)
```python
from torch.utils.checkpoint import checkpoint
class CheckpointedBlock(nn.Module):
def forward(self, x):
return checkpoint(self._fwd, x, use_reentrant=False)
def _fwd(self, x): return self.attn(self.norm(x)) + x
```
### DeepSpeed PipelineModule
```python
import deepspeed
from deepspeed.pipe import PipelineModule, LayerSpec
specs = [LayerSpec(Block, 1024) for _ in range(8)]
model = PipelineModule(layers=specs, num_stages=4, partition_method="uniform")
engine, _, _, _ = deepspeed.initialize(model=model, config=ds_config)
loss = engine.train_batch(data_iter)
```
### 3D parallelism (TP × PP × DP)
```python
# 매 Megatron / NeMo 의 conventional layout
# world_size = TP × PP × DP
# Llama 3 405B 학습: TP=8, PP=16, DP=128 → 16384 GPUs
mesh = init_device_mesh("cuda", (DP, PP, TP), mesh_dim_names=("dp","pp","tp"))
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| 매 single node, ≤8 GPU | TP only (NVLink) |
| 매 multi-node, model > node mem | TP intra-node + PP inter-node |
| 매 100B+ params | TP × PP × DP (3D) |
| 매 inference latency 중요 | TP > PP (PP의 bubble 손해) |
| 매 throughput 중심 training | PP + DP 큰 micro-batch |
**기본값**: 매 LLM 학습은 1F1B + activation recompute + 3D parallel.
## 🔗 Graph
- 부모: [[Distributed Training]]
## 🤖 LLM 활용
**언제**: 매 모델 weight 가 단일 GPU mem 초과 + 매 multi-node training. 매 cross-node bandwidth 가 TP에 부족할 때.
**언제 X**: 매 단일 node 안 fits. 매 매우 작은 batch (bubble 비율 폭증). 매 inference latency-critical.
## ❌ 안티패턴
- **Bubble ignore**: 매 micro-batch K=1 → 매 GPU의 (stages-1)/stages 가 idle.
- **Uneven partition**: 매 stage 별 FLOPs 불균형 → 매 가장 느린 stage 가 throughput 결정.
- **PP only no DP**: 매 K 늘려도 batch size 한계 → 매 DP 병행 필수.
- **Embedding 분리 무시**: 매 input/output embedding 의 같은 stage 배치 → tied weight sync 단순.
## 🧪 검증 / 중복
- Verified (Megatron-LM paper, GPipe, PipeDream, PyTorch pipelining docs 2026).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — PP schedules + 3D parallel patterns |