Files
2nd/10_Wiki/Topics/Other/Parallel-Computing.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.4 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-parallel-computing Parallel Computing 10_Wiki/Topics verified self
Parallel Processing
Concurrent Computing
HPC
none A 0.95 applied
hpc
parallelism
gpu
distributed
2026-05-10 pending
language framework
python-cuda jax-pytorch-mpi

Parallel Computing

매 한 줄

"매 multiple computations 매 simultaneously 실행". 매 Flynn taxonomy (SISD/SIMD/MIMD) 부터 매 modern GPU SIMT, 매 distributed cluster (MPI, NCCL), 매 Llama 3.x 405B 의 4D parallelism (DP/TP/PP/SP) 까지. 매 2026 의 default workload 매 inference / training 의 parallel 이 매 single-core sequential 압도.

매 핵심

매 Flynn's taxonomy

  • SISD: 매 single instruction, single data — 매 classic CPU.
  • SIMD: 매 single instruction, multiple data — 매 AVX-512, GPU warp.
  • MIMD: 매 multiple instruction, multiple data — 매 multi-core CPU, cluster.
  • SIMT: 매 single instruction, multiple thread — 매 NVIDIA / AMD GPU.

매 parallelism dimensions (modern DL)

  • Data parallel (DP): 매 same model, 매 different batches.
  • Tensor parallel (TP): 매 single tensor 매 split across devices.
  • Pipeline parallel (PP): 매 layers 매 stages 로 split.
  • Sequence parallel (SP): 매 sequence dim split (long context).
  • Expert parallel (EP): 매 MoE 매 experts 매 across devices.

매 응용

  1. LLM training: Llama 3.x 405B = DP×TP×PP×SP×EP combination.
  2. Inference: vLLM 매 continuous batching + tensor parallel.
  3. Scientific compute: weather, molecular dynamics (MPI).
  4. Rendering: Pixar RenderMan 매 distributed.

💻 패턴

NumPy → JAX SIMD vectorization

# 매 implicit SIMD on CPU/GPU/TPU
import jax
import jax.numpy as jnp

@jax.jit
def matmul_vectorized(A, B):
    return jnp.einsum("bij,bjk->bik", A, B)

# vmap: auto-vectorize over batch dim
batched = jax.vmap(lambda x, y: x @ y)(A, B)

CUDA kernel (SIMT)

// 매 explicit thread-level parallelism
__global__ void vec_add(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}

// launch: vec_add<<<(n+255)/256, 256>>>(a, b, c, n);

Multi-GPU data parallel (PyTorch)

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend="nccl")
model = DDP(model.cuda(), device_ids=[local_rank])

for batch in loader:
    loss = model(batch).loss
    loss.backward()  # 매 NCCL all-reduce gradients
    optim.step()

Tensor parallel (megatron-style)

# 매 single Linear split column-wise across N GPUs
class ColumnParallelLinear(nn.Module):
    def __init__(self, d_in, d_out, world_size):
        super().__init__()
        self.weight = nn.Parameter(torch.empty(d_out // world_size, d_in))
    def forward(self, x):
        local_out = x @ self.weight.T
        # gather across tp group
        return all_gather(local_out, dim=-1)

MPI scientific compute

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank, size = comm.Get_rank(), comm.Get_size()

# 매 domain decomposition
local_data = scatter_grid(global_grid, rank, size)
local_result = compute_step(local_data)
global_result = comm.allreduce(local_result, op=MPI.SUM)

Async pipeline parallel

# GPipe / 1F1B schedule
def pipeline_step(stages, micro_batches):
    """1F1B: 1 forward, 1 backward interleaved."""
    fwd_queue = []
    for mb in micro_batches:
        for s, stage in enumerate(stages):
            mb = stage.forward(mb)
            fwd_queue.append((s, mb))
    for s, mb in reversed(fwd_queue):
        stages[s].backward(mb)

매 결정 기준

Workload Parallelism
매 single-machine CPU bound multiprocessing / Ray
매 single-GPU dense ops CUDA / JAX SIMT
매 multi-GPU same-node NCCL DDP / FSDP
매 multi-node training DP×TP×PP (Megatron, DeepSpeed)
매 long-context (128K+) + Sequence Parallel
매 MoE model + Expert Parallel
매 scientific HPC MPI + domain decomposition

기본값: 매 SIMD (numpy/jax) 시작 → 매 GPU SIMT → 매 multi-GPU DDP → 매 4D parallelism 의 progression.

🔗 Graph

🤖 LLM 활용

언제: 매 parallelism strategy selection, 매 communication overhead analysis, 매 NCCL/MPI debugging. 언제 X: 매 sequential algorithm 매 inherently — 매 Amdahl bound 의 X.

안티패턴

  • Premature parallelization: 매 sequential profile X → blind parallelize.
  • Communication-bound: 매 too fine-grained 매 chunks → 매 NCCL overhead 압도.
  • Load imbalance: 매 uneven shard sizes → 매 stragglers.
  • Race conditions: 매 shared state w/o sync.

🧪 검증 / 중복

  • Verified (Hennessy & Patterson 6e; Megatron-LM paper 2019; Llama 3 paper 2024; CUDA C++ Programming Guide 12.x).
  • 신뢰도 A.
  • Parallel-Computing 매 alias / redirect.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Flynn + 4D DL parallelism + modern stack