f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
167 lines
5.4 KiB
Markdown
167 lines
5.4 KiB
Markdown
---
|
||
id: wiki-2026-0508-parallel-computing
|
||
title: Parallel Computing
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Parallel Processing, Concurrent Computing, HPC]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.95
|
||
verification_status: applied
|
||
tags: [hpc, parallelism, gpu, distributed]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: python-cuda
|
||
framework: jax-pytorch-mpi
|
||
---
|
||
|
||
# Parallel Computing
|
||
|
||
## 매 한 줄
|
||
> **"매 multiple computations 매 simultaneously 실행"**. 매 Flynn taxonomy (SISD/SIMD/MIMD) 부터 매 modern GPU SIMT, 매 distributed cluster (MPI, NCCL), 매 Llama 3.x 405B 의 4D parallelism (DP/TP/PP/SP) 까지. 매 2026 의 default workload 매 inference / training 의 parallel 이 매 single-core sequential 압도.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 Flynn's taxonomy
|
||
- **SISD**: 매 single instruction, single data — 매 classic CPU.
|
||
- **SIMD**: 매 single instruction, multiple data — 매 AVX-512, GPU warp.
|
||
- **MIMD**: 매 multiple instruction, multiple data — 매 multi-core CPU, cluster.
|
||
- **SIMT**: 매 single instruction, multiple thread — 매 NVIDIA / AMD GPU.
|
||
|
||
### 매 parallelism dimensions (modern DL)
|
||
- **Data parallel (DP)**: 매 same model, 매 different batches.
|
||
- **Tensor parallel (TP)**: 매 single tensor 매 split across devices.
|
||
- **Pipeline parallel (PP)**: 매 layers 매 stages 로 split.
|
||
- **Sequence parallel (SP)**: 매 sequence dim split (long context).
|
||
- **Expert parallel (EP)**: 매 MoE 매 experts 매 across devices.
|
||
|
||
### 매 응용
|
||
1. **LLM training**: Llama 3.x 405B = DP×TP×PP×SP×EP combination.
|
||
2. **Inference**: vLLM 매 continuous batching + tensor parallel.
|
||
3. **Scientific compute**: weather, molecular dynamics (MPI).
|
||
4. **Rendering**: Pixar RenderMan 매 distributed.
|
||
|
||
## 💻 패턴
|
||
|
||
### NumPy → JAX SIMD vectorization
|
||
```python
|
||
# 매 implicit SIMD on CPU/GPU/TPU
|
||
import jax
|
||
import jax.numpy as jnp
|
||
|
||
@jax.jit
|
||
def matmul_vectorized(A, B):
|
||
return jnp.einsum("bij,bjk->bik", A, B)
|
||
|
||
# vmap: auto-vectorize over batch dim
|
||
batched = jax.vmap(lambda x, y: x @ y)(A, B)
|
||
```
|
||
|
||
### CUDA kernel (SIMT)
|
||
```cpp
|
||
// 매 explicit thread-level parallelism
|
||
__global__ void vec_add(float* a, float* b, float* c, int n) {
|
||
int idx = blockIdx.x * blockDim.x + threadIdx.x;
|
||
if (idx < n) c[idx] = a[idx] + b[idx];
|
||
}
|
||
|
||
// launch: vec_add<<<(n+255)/256, 256>>>(a, b, c, n);
|
||
```
|
||
|
||
### Multi-GPU data parallel (PyTorch)
|
||
```python
|
||
import torch
|
||
import torch.distributed as dist
|
||
from torch.nn.parallel import DistributedDataParallel as DDP
|
||
|
||
dist.init_process_group(backend="nccl")
|
||
model = DDP(model.cuda(), device_ids=[local_rank])
|
||
|
||
for batch in loader:
|
||
loss = model(batch).loss
|
||
loss.backward() # 매 NCCL all-reduce gradients
|
||
optim.step()
|
||
```
|
||
|
||
### Tensor parallel (megatron-style)
|
||
```python
|
||
# 매 single Linear split column-wise across N GPUs
|
||
class ColumnParallelLinear(nn.Module):
|
||
def __init__(self, d_in, d_out, world_size):
|
||
super().__init__()
|
||
self.weight = nn.Parameter(torch.empty(d_out // world_size, d_in))
|
||
def forward(self, x):
|
||
local_out = x @ self.weight.T
|
||
# gather across tp group
|
||
return all_gather(local_out, dim=-1)
|
||
```
|
||
|
||
### MPI scientific compute
|
||
```python
|
||
from mpi4py import MPI
|
||
comm = MPI.COMM_WORLD
|
||
rank, size = comm.Get_rank(), comm.Get_size()
|
||
|
||
# 매 domain decomposition
|
||
local_data = scatter_grid(global_grid, rank, size)
|
||
local_result = compute_step(local_data)
|
||
global_result = comm.allreduce(local_result, op=MPI.SUM)
|
||
```
|
||
|
||
### Async pipeline parallel
|
||
```python
|
||
# GPipe / 1F1B schedule
|
||
def pipeline_step(stages, micro_batches):
|
||
"""1F1B: 1 forward, 1 backward interleaved."""
|
||
fwd_queue = []
|
||
for mb in micro_batches:
|
||
for s, stage in enumerate(stages):
|
||
mb = stage.forward(mb)
|
||
fwd_queue.append((s, mb))
|
||
for s, mb in reversed(fwd_queue):
|
||
stages[s].backward(mb)
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| Workload | Parallelism |
|
||
|---|---|
|
||
| 매 single-machine CPU bound | multiprocessing / Ray |
|
||
| 매 single-GPU dense ops | CUDA / JAX SIMT |
|
||
| 매 multi-GPU same-node | NCCL DDP / FSDP |
|
||
| 매 multi-node training | DP×TP×PP (Megatron, DeepSpeed) |
|
||
| 매 long-context (128K+) | + Sequence Parallel |
|
||
| 매 MoE model | + Expert Parallel |
|
||
| 매 scientific HPC | MPI + domain decomposition |
|
||
|
||
**기본값**: 매 SIMD (numpy/jax) 시작 → 매 GPU SIMT → 매 multi-GPU DDP → 매 4D parallelism 의 progression.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Distributed-Systems]]
|
||
- 변형: [[Distributed-Training]]
|
||
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]]
|
||
- Adjacent: [[Concurrency]] · [[Parallel-Computing|Parallel-Processing]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 parallelism strategy selection, 매 communication overhead analysis, 매 NCCL/MPI debugging.
|
||
**언제 X**: 매 sequential algorithm 매 inherently — 매 Amdahl bound 의 X.
|
||
|
||
## ❌ 안티패턴
|
||
- **Premature parallelization**: 매 sequential profile X → blind parallelize.
|
||
- **Communication-bound**: 매 too fine-grained 매 chunks → 매 NCCL overhead 압도.
|
||
- **Load imbalance**: 매 uneven shard sizes → 매 stragglers.
|
||
- **Race conditions**: 매 shared state w/o sync.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Hennessy & Patterson 6e; Megatron-LM paper 2019; Llama 3 paper 2024; CUDA C++ Programming Guide 12.x).
|
||
- 신뢰도 A.
|
||
- 매 [[Parallel-Computing|Parallel-Processing]] 매 alias / redirect.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — Flynn + 4D DL parallelism + modern stack |
|