2nd/10_Wiki/Topics/Other/Parallel-Computing.md

---
id: wiki-2026-0508-parallel-computing
title: Parallel Computing
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Parallel Processing, Concurrent Computing, HPC]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [hpc, parallelism, gpu, distributed]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python-cuda
  framework: jax-pytorch-mpi
---

# Parallel Computing

## 매 한 줄
> **"매 multiple computations 매 simultaneously 실행"**. 매 Flynn taxonomy (SISD/SIMD/MIMD) 부터 매 modern GPU SIMT, 매 distributed cluster (MPI, NCCL), 매 Llama 3.x 405B 의 4D parallelism (DP/TP/PP/SP) 까지. 매 2026 의 default workload 매 inference / training 의 parallel 이 매 single-core sequential 압도.

## 매 핵심

### 매 Flynn's taxonomy
- **SISD**: 매 single instruction, single data — 매 classic CPU.
- **SIMD**: 매 single instruction, multiple data — 매 AVX-512, GPU warp.
- **MIMD**: 매 multiple instruction, multiple data — 매 multi-core CPU, cluster.
- **SIMT**: 매 single instruction, multiple thread — 매 NVIDIA / AMD GPU.

### 매 parallelism dimensions (modern DL)
- **Data parallel (DP)**: 매 same model, 매 different batches.
- **Tensor parallel (TP)**: 매 single tensor 매 split across devices.
- **Pipeline parallel (PP)**: 매 layers 매 stages 로 split.
- **Sequence parallel (SP)**: 매 sequence dim split (long context).
- **Expert parallel (EP)**: 매 MoE 매 experts 매 across devices.

### 매 응용
1. **LLM training**: Llama 3.x 405B = DP×TP×PP×SP×EP combination.
2. **Inference**: vLLM 매 continuous batching + tensor parallel.
3. **Scientific compute**: weather, molecular dynamics (MPI).
4. **Rendering**: Pixar RenderMan 매 distributed.

## 💻 패턴

### NumPy → JAX SIMD vectorization
```python
# 매 implicit SIMD on CPU/GPU/TPU
import jax
import jax.numpy as jnp

@jax.jit
def matmul_vectorized(A, B):
    return jnp.einsum("bij,bjk->bik", A, B)

# vmap: auto-vectorize over batch dim
batched = jax.vmap(lambda x, y: x @ y)(A, B)
```

### CUDA kernel (SIMT)
```cpp
// 매 explicit thread-level parallelism
__global__ void vec_add(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

// launch: vec_add<<<(n+255)/256, 256>>>(a, b, c, n);
```

### Multi-GPU data parallel (PyTorch)
```python
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend="nccl")
model = DDP(model.cuda(), device_ids=[local_rank])

for batch in loader:
    loss = model(batch).loss
    loss.backward()  # 매 NCCL all-reduce gradients
    optim.step()
```

### Tensor parallel (megatron-style)
```python
# 매 single Linear split column-wise across N GPUs
class ColumnParallelLinear(nn.Module):
    def __init__(self, d_in, d_out, world_size):
        super().__init__()
        self.weight = nn.Parameter(torch.empty(d_out // world_size, d_in))
    def forward(self, x):
        local_out = x @ self.weight.T
        # gather across tp group
        return all_gather(local_out, dim=-1)
```

### MPI scientific compute
```python
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank, size = comm.Get_rank(), comm.Get_size()

# 매 domain decomposition
local_data = scatter_grid(global_grid, rank, size)
local_result = compute_step(local_data)
global_result = comm.allreduce(local_result, op=MPI.SUM)
```

### Async pipeline parallel
```python
# GPipe / 1F1B schedule
def pipeline_step(stages, micro_batches):
    """1F1B: 1 forward, 1 backward interleaved."""
    fwd_queue = []
    for mb in micro_batches:
        for s, stage in enumerate(stages):
            mb = stage.forward(mb)
            fwd_queue.append((s, mb))
    for s, mb in reversed(fwd_queue):
        stages[s].backward(mb)
```

## 매 결정 기준
| Workload | Parallelism |
|---|---|
| 매 single-machine CPU bound | multiprocessing / Ray |
| 매 single-GPU dense ops | CUDA / JAX SIMT |
| 매 multi-GPU same-node | NCCL DDP / FSDP |
| 매 multi-node training | DP×TP×PP (Megatron, DeepSpeed) |
| 매 long-context (128K+) | + Sequence Parallel |
| 매 MoE model | + Expert Parallel |
| 매 scientific HPC | MPI + domain decomposition |

**기본값**: 매 SIMD (numpy/jax) 시작 → 매 GPU SIMT → 매 multi-GPU DDP → 매 4D parallelism 의 progression.

## 🔗 Graph
- 부모: [[Distributed-Systems]]
- 변형: [[Distributed-Training]]
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]]
- Adjacent: [[Concurrency]] · [[Parallel-Computing|Parallel-Processing]]

## 🤖 LLM 활용
**언제**: 매 parallelism strategy selection, 매 communication overhead analysis, 매 NCCL/MPI debugging.
**언제 X**: 매 sequential algorithm 매 inherently — 매 Amdahl bound 의 X.

## ❌ 안티패턴
- **Premature parallelization**: 매 sequential profile X → blind parallelize.
- **Communication-bound**: 매 too fine-grained 매 chunks → 매 NCCL overhead 압도.
- **Load imbalance**: 매 uneven shard sizes → 매 stragglers.
- **Race conditions**: 매 shared state w/o sync.

## 🧪 검증 / 중복
- Verified (Hennessy & Patterson 6e; Megatron-LM paper 2019; Llama 3 paper 2024; CUDA C++ Programming Guide 12.x).
- 신뢰도 A.
- 매 [[Parallel-Computing|Parallel-Processing]] 매 alias / redirect.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Flynn + 4D DL parallelism + modern stack |