Files
2nd/10_Wiki/Topics/Other/Parallel-Computing.md
T
2026-05-10 22:08:15 +09:00

167 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-parallel-computing
title: Parallel Computing
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Parallel Processing, Concurrent Computing, HPC]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [hpc, parallelism, gpu, distributed]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python-cuda
framework: jax-pytorch-mpi
---
# Parallel Computing
## 매 한 줄
> **"매 multiple computations 매 simultaneously 실행"**. 매 Flynn taxonomy (SISD/SIMD/MIMD) 부터 매 modern GPU SIMT, 매 distributed cluster (MPI, NCCL), 매 Llama 3.x 405B 의 4D parallelism (DP/TP/PP/SP) 까지. 매 2026 의 default workload 매 inference / training 의 parallel 이 매 single-core sequential 압도.
## 매 핵심
### 매 Flynn's taxonomy
- **SISD**: 매 single instruction, single data — 매 classic CPU.
- **SIMD**: 매 single instruction, multiple data — 매 AVX-512, GPU warp.
- **MIMD**: 매 multiple instruction, multiple data — 매 multi-core CPU, cluster.
- **SIMT**: 매 single instruction, multiple thread — 매 NVIDIA / AMD GPU.
### 매 parallelism dimensions (modern DL)
- **Data parallel (DP)**: 매 same model, 매 different batches.
- **Tensor parallel (TP)**: 매 single tensor 매 split across devices.
- **Pipeline parallel (PP)**: 매 layers 매 stages 로 split.
- **Sequence parallel (SP)**: 매 sequence dim split (long context).
- **Expert parallel (EP)**: 매 MoE 매 experts 매 across devices.
### 매 응용
1. **LLM training**: Llama 3.x 405B = DP×TP×PP×SP×EP combination.
2. **Inference**: vLLM 매 continuous batching + tensor parallel.
3. **Scientific compute**: weather, molecular dynamics (MPI).
4. **Rendering**: Pixar RenderMan 매 distributed.
## 💻 패턴
### NumPy → JAX SIMD vectorization
```python
# 매 implicit SIMD on CPU/GPU/TPU
import jax
import jax.numpy as jnp
@jax.jit
def matmul_vectorized(A, B):
return jnp.einsum("bij,bjk->bik", A, B)
# vmap: auto-vectorize over batch dim
batched = jax.vmap(lambda x, y: x @ y)(A, B)
```
### CUDA kernel (SIMT)
```cpp
// 매 explicit thread-level parallelism
__global__ void vec_add(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) c[idx] = a[idx] + b[idx];
}
// launch: vec_add<<<(n+255)/256, 256>>>(a, b, c, n);
```
### Multi-GPU data parallel (PyTorch)
```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group(backend="nccl")
model = DDP(model.cuda(), device_ids=[local_rank])
for batch in loader:
loss = model(batch).loss
loss.backward() # 매 NCCL all-reduce gradients
optim.step()
```
### Tensor parallel (megatron-style)
```python
# 매 single Linear split column-wise across N GPUs
class ColumnParallelLinear(nn.Module):
def __init__(self, d_in, d_out, world_size):
super().__init__()
self.weight = nn.Parameter(torch.empty(d_out // world_size, d_in))
def forward(self, x):
local_out = x @ self.weight.T
# gather across tp group
return all_gather(local_out, dim=-1)
```
### MPI scientific compute
```python
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank, size = comm.Get_rank(), comm.Get_size()
# 매 domain decomposition
local_data = scatter_grid(global_grid, rank, size)
local_result = compute_step(local_data)
global_result = comm.allreduce(local_result, op=MPI.SUM)
```
### Async pipeline parallel
```python
# GPipe / 1F1B schedule
def pipeline_step(stages, micro_batches):
"""1F1B: 1 forward, 1 backward interleaved."""
fwd_queue = []
for mb in micro_batches:
for s, stage in enumerate(stages):
mb = stage.forward(mb)
fwd_queue.append((s, mb))
for s, mb in reversed(fwd_queue):
stages[s].backward(mb)
```
## 매 결정 기준
| Workload | Parallelism |
|---|---|
| 매 single-machine CPU bound | multiprocessing / Ray |
| 매 single-GPU dense ops | CUDA / JAX SIMT |
| 매 multi-GPU same-node | NCCL DDP / FSDP |
| 매 multi-node training | DP×TP×PP (Megatron, DeepSpeed) |
| 매 long-context (128K+) | + Sequence Parallel |
| 매 MoE model | + Expert Parallel |
| 매 scientific HPC | MPI + domain decomposition |
**기본값**: 매 SIMD (numpy/jax) 시작 → 매 GPU SIMT → 매 multi-GPU DDP → 매 4D parallelism 의 progression.
## 🔗 Graph
- 부모: [[Computer-Architecture]] · [[Distributed-Systems]]
- 변형: [[GPU-Computing]] · [[Distributed-Training]]
- 응용: [[LLM-Training]] · [[Scientific-Computing]] · [[vLLM]]
- Adjacent: [[Concurrency]] · [[Parallel-Processing]]
## 🤖 LLM 활용
**언제**: 매 parallelism strategy selection, 매 communication overhead analysis, 매 NCCL/MPI debugging.
**언제 X**: 매 sequential algorithm 매 inherently — 매 Amdahl bound 의 X.
## ❌ 안티패턴
- **Premature parallelization**: 매 sequential profile X → blind parallelize.
- **Communication-bound**: 매 too fine-grained 매 chunks → 매 NCCL overhead 압도.
- **Load imbalance**: 매 uneven shard sizes → 매 stragglers.
- **Race conditions**: 매 shared state w/o sync.
## 🧪 검증 / 중복
- Verified (Hennessy & Patterson 6e; Megatron-LM paper 2019; Llama 3 paper 2024; CUDA C++ Programming Guide 12.x).
- 신뢰도 A.
- 매 [[Parallel-Processing]] 매 alias / redirect.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Flynn + 4D DL parallelism + modern stack |