--- id: wiki-2026-0508-parallel-computing title: Parallel Computing category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Parallel Processing, Concurrent Computing, HPC] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [hpc, parallelism, gpu, distributed] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python-cuda framework: jax-pytorch-mpi --- # Parallel Computing ## 매 한 줄 > **"매 multiple computations 매 simultaneously 실행"**. 매 Flynn taxonomy (SISD/SIMD/MIMD) 부터 매 modern GPU SIMT, 매 distributed cluster (MPI, NCCL), 매 Llama 3.x 405B 의 4D parallelism (DP/TP/PP/SP) 까지. 매 2026 의 default workload 매 inference / training 의 parallel 이 매 single-core sequential 압도. ## 매 핵심 ### 매 Flynn's taxonomy - **SISD**: 매 single instruction, single data — 매 classic CPU. - **SIMD**: 매 single instruction, multiple data — 매 AVX-512, GPU warp. - **MIMD**: 매 multiple instruction, multiple data — 매 multi-core CPU, cluster. - **SIMT**: 매 single instruction, multiple thread — 매 NVIDIA / AMD GPU. ### 매 parallelism dimensions (modern DL) - **Data parallel (DP)**: 매 same model, 매 different batches. - **Tensor parallel (TP)**: 매 single tensor 매 split across devices. - **Pipeline parallel (PP)**: 매 layers 매 stages 로 split. - **Sequence parallel (SP)**: 매 sequence dim split (long context). - **Expert parallel (EP)**: 매 MoE 매 experts 매 across devices. ### 매 응용 1. **LLM training**: Llama 3.x 405B = DP×TP×PP×SP×EP combination. 2. **Inference**: vLLM 매 continuous batching + tensor parallel. 3. **Scientific compute**: weather, molecular dynamics (MPI). 4. **Rendering**: Pixar RenderMan 매 distributed. ## 💻 패턴 ### NumPy → JAX SIMD vectorization ```python # 매 implicit SIMD on CPU/GPU/TPU import jax import jax.numpy as jnp @jax.jit def matmul_vectorized(A, B): return jnp.einsum("bij,bjk->bik", A, B) # vmap: auto-vectorize over batch dim batched = jax.vmap(lambda x, y: x @ y)(A, B) ``` ### CUDA kernel (SIMT) ```cpp // 매 explicit thread-level parallelism __global__ void vec_add(float* a, float* b, float* c, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) c[idx] = a[idx] + b[idx]; } // launch: vec_add<<<(n+255)/256, 256>>>(a, b, c, n); ``` ### Multi-GPU data parallel (PyTorch) ```python import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group(backend="nccl") model = DDP(model.cuda(), device_ids=[local_rank]) for batch in loader: loss = model(batch).loss loss.backward() # 매 NCCL all-reduce gradients optim.step() ``` ### Tensor parallel (megatron-style) ```python # 매 single Linear split column-wise across N GPUs class ColumnParallelLinear(nn.Module): def __init__(self, d_in, d_out, world_size): super().__init__() self.weight = nn.Parameter(torch.empty(d_out // world_size, d_in)) def forward(self, x): local_out = x @ self.weight.T # gather across tp group return all_gather(local_out, dim=-1) ``` ### MPI scientific compute ```python from mpi4py import MPI comm = MPI.COMM_WORLD rank, size = comm.Get_rank(), comm.Get_size() # 매 domain decomposition local_data = scatter_grid(global_grid, rank, size) local_result = compute_step(local_data) global_result = comm.allreduce(local_result, op=MPI.SUM) ``` ### Async pipeline parallel ```python # GPipe / 1F1B schedule def pipeline_step(stages, micro_batches): """1F1B: 1 forward, 1 backward interleaved.""" fwd_queue = [] for mb in micro_batches: for s, stage in enumerate(stages): mb = stage.forward(mb) fwd_queue.append((s, mb)) for s, mb in reversed(fwd_queue): stages[s].backward(mb) ``` ## 매 결정 기준 | Workload | Parallelism | |---|---| | 매 single-machine CPU bound | multiprocessing / Ray | | 매 single-GPU dense ops | CUDA / JAX SIMT | | 매 multi-GPU same-node | NCCL DDP / FSDP | | 매 multi-node training | DP×TP×PP (Megatron, DeepSpeed) | | 매 long-context (128K+) | + Sequence Parallel | | 매 MoE model | + Expert Parallel | | 매 scientific HPC | MPI + domain decomposition | **기본값**: 매 SIMD (numpy/jax) 시작 → 매 GPU SIMT → 매 multi-GPU DDP → 매 4D parallelism 의 progression. ## 🔗 Graph - 부모: [[Distributed-Systems]] - 변형: [[Distributed-Training]] - 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] - Adjacent: [[Concurrency]] · [[Parallel-Computing|Parallel-Processing]] ## 🤖 LLM 활용 **언제**: 매 parallelism strategy selection, 매 communication overhead analysis, 매 NCCL/MPI debugging. **언제 X**: 매 sequential algorithm 매 inherently — 매 Amdahl bound 의 X. ## ❌ 안티패턴 - **Premature parallelization**: 매 sequential profile X → blind parallelize. - **Communication-bound**: 매 too fine-grained 매 chunks → 매 NCCL overhead 압도. - **Load imbalance**: 매 uneven shard sizes → 매 stragglers. - **Race conditions**: 매 shared state w/o sync. ## 🧪 검증 / 중복 - Verified (Hennessy & Patterson 6e; Megatron-LM paper 2019; Llama 3 paper 2024; CUDA C++ Programming Guide 12.x). - 신뢰도 A. - 매 [[Parallel-Computing|Parallel-Processing]] 매 alias / redirect. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Flynn + 4D DL parallelism + modern stack |