Files
2nd/10_Wiki/Topics/Other/Parallel-Computing.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

167 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-parallel-computing
title: Parallel Computing
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Parallel Processing, Concurrent Computing, HPC]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [hpc, parallelism, gpu, distributed]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python-cuda
framework: jax-pytorch-mpi
---
# Parallel Computing
## 매 한 줄
> **"매 multiple computations 매 simultaneously 실행"**. 매 Flynn taxonomy (SISD/SIMD/MIMD) 부터 매 modern GPU SIMT, 매 distributed cluster (MPI, NCCL), 매 Llama 3.x 405B 의 4D parallelism (DP/TP/PP/SP) 까지. 매 2026 의 default workload 매 inference / training 의 parallel 이 매 single-core sequential 압도.
## 매 핵심
### 매 Flynn's taxonomy
- **SISD**: 매 single instruction, single data — 매 classic CPU.
- **SIMD**: 매 single instruction, multiple data — 매 AVX-512, GPU warp.
- **MIMD**: 매 multiple instruction, multiple data — 매 multi-core CPU, cluster.
- **SIMT**: 매 single instruction, multiple thread — 매 NVIDIA / AMD GPU.
### 매 parallelism dimensions (modern DL)
- **Data parallel (DP)**: 매 same model, 매 different batches.
- **Tensor parallel (TP)**: 매 single tensor 매 split across devices.
- **Pipeline parallel (PP)**: 매 layers 매 stages 로 split.
- **Sequence parallel (SP)**: 매 sequence dim split (long context).
- **Expert parallel (EP)**: 매 MoE 매 experts 매 across devices.
### 매 응용
1. **LLM training**: Llama 3.x 405B = DP×TP×PP×SP×EP combination.
2. **Inference**: vLLM 매 continuous batching + tensor parallel.
3. **Scientific compute**: weather, molecular dynamics (MPI).
4. **Rendering**: Pixar RenderMan 매 distributed.
## 💻 패턴
### NumPy → JAX SIMD vectorization
```python
# 매 implicit SIMD on CPU/GPU/TPU
import jax
import jax.numpy as jnp
@jax.jit
def matmul_vectorized(A, B):
return jnp.einsum("bij,bjk->bik", A, B)
# vmap: auto-vectorize over batch dim
batched = jax.vmap(lambda x, y: x @ y)(A, B)
```
### CUDA kernel (SIMT)
```cpp
// 매 explicit thread-level parallelism
__global__ void vec_add(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) c[idx] = a[idx] + b[idx];
}
// launch: vec_add<<<(n+255)/256, 256>>>(a, b, c, n);
```
### Multi-GPU data parallel (PyTorch)
```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group(backend="nccl")
model = DDP(model.cuda(), device_ids=[local_rank])
for batch in loader:
loss = model(batch).loss
loss.backward() # 매 NCCL all-reduce gradients
optim.step()
```
### Tensor parallel (megatron-style)
```python
# 매 single Linear split column-wise across N GPUs
class ColumnParallelLinear(nn.Module):
def __init__(self, d_in, d_out, world_size):
super().__init__()
self.weight = nn.Parameter(torch.empty(d_out // world_size, d_in))
def forward(self, x):
local_out = x @ self.weight.T
# gather across tp group
return all_gather(local_out, dim=-1)
```
### MPI scientific compute
```python
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank, size = comm.Get_rank(), comm.Get_size()
# 매 domain decomposition
local_data = scatter_grid(global_grid, rank, size)
local_result = compute_step(local_data)
global_result = comm.allreduce(local_result, op=MPI.SUM)
```
### Async pipeline parallel
```python
# GPipe / 1F1B schedule
def pipeline_step(stages, micro_batches):
"""1F1B: 1 forward, 1 backward interleaved."""
fwd_queue = []
for mb in micro_batches:
for s, stage in enumerate(stages):
mb = stage.forward(mb)
fwd_queue.append((s, mb))
for s, mb in reversed(fwd_queue):
stages[s].backward(mb)
```
## 매 결정 기준
| Workload | Parallelism |
|---|---|
| 매 single-machine CPU bound | multiprocessing / Ray |
| 매 single-GPU dense ops | CUDA / JAX SIMT |
| 매 multi-GPU same-node | NCCL DDP / FSDP |
| 매 multi-node training | DP×TP×PP (Megatron, DeepSpeed) |
| 매 long-context (128K+) | + Sequence Parallel |
| 매 MoE model | + Expert Parallel |
| 매 scientific HPC | MPI + domain decomposition |
**기본값**: 매 SIMD (numpy/jax) 시작 → 매 GPU SIMT → 매 multi-GPU DDP → 매 4D parallelism 의 progression.
## 🔗 Graph
- 부모: [[Distributed-Systems]]
- 변형: [[Distributed-Training]]
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]]
- Adjacent: [[Concurrency]] · [[Parallel-Computing|Parallel-Processing]]
## 🤖 LLM 활용
**언제**: 매 parallelism strategy selection, 매 communication overhead analysis, 매 NCCL/MPI debugging.
**언제 X**: 매 sequential algorithm 매 inherently — 매 Amdahl bound 의 X.
## ❌ 안티패턴
- **Premature parallelization**: 매 sequential profile X → blind parallelize.
- **Communication-bound**: 매 too fine-grained 매 chunks → 매 NCCL overhead 압도.
- **Load imbalance**: 매 uneven shard sizes → 매 stragglers.
- **Race conditions**: 매 shared state w/o sync.
## 🧪 검증 / 중복
- Verified (Hennessy & Patterson 6e; Megatron-LM paper 2019; Llama 3 paper 2024; CUDA C++ Programming Guide 12.x).
- 신뢰도 A.
- 매 [[Parallel-Computing|Parallel-Processing]] 매 alias / redirect.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Flynn + 4D DL parallelism + modern stack |