"매 multiple computations 매 simultaneously 실행". 매 Flynn taxonomy (SISD/SIMD/MIMD) 부터 매 modern GPU SIMT, 매 distributed cluster (MPI, NCCL), 매 Llama 3.x 405B 의 4D parallelism (DP/TP/PP/SP) 까지. 매 2026 의 default workload 매 inference / training 의 parallel 이 매 single-core sequential 압도.
매 핵심
매 Flynn's taxonomy
SISD: 매 single instruction, single data — 매 classic CPU.
SIMD: 매 single instruction, multiple data — 매 AVX-512, GPU warp.
MIMD: 매 multiple instruction, multiple data — 매 multi-core CPU, cluster.
SIMT: 매 single instruction, multiple thread — 매 NVIDIA / AMD GPU.
매 parallelism dimensions (modern DL)
Data parallel (DP): 매 same model, 매 different batches.
Tensor parallel (TP): 매 single tensor 매 split across devices.
Pipeline parallel (PP): 매 layers 매 stages 로 split.
Sequence parallel (SP): 매 sequence dim split (long context).
Expert parallel (EP): 매 MoE 매 experts 매 across devices.
# 매 implicit SIMD on CPU/GPU/TPUimportjaximportjax.numpyasjnp@jax.jitdefmatmul_vectorized(A,B):returnjnp.einsum("bij,bjk->bik",A,B)# vmap: auto-vectorize over batch dimbatched=jax.vmap(lambdax,y:x@y)(A,B)
importtorchimporttorch.distributedasdistfromtorch.nn.parallelimportDistributedDataParallelasDDPdist.init_process_group(backend="nccl")model=DDP(model.cuda(),device_ids=[local_rank])forbatchinloader:loss=model(batch).lossloss.backward()# 매 NCCL all-reduce gradientsoptim.step()
Tensor parallel (megatron-style)
# 매 single Linear split column-wise across N GPUsclassColumnParallelLinear(nn.Module):def__init__(self,d_in,d_out,world_size):super().__init__()self.weight=nn.Parameter(torch.empty(d_out//world_size,d_in))defforward(self,x):local_out=x@self.weight.T# gather across tp groupreturnall_gather(local_out,dim=-1)
MPI scientific compute
frommpi4pyimportMPIcomm=MPI.COMM_WORLDrank,size=comm.Get_rank(),comm.Get_size()# 매 domain decompositionlocal_data=scatter_grid(global_grid,rank,size)local_result=compute_step(local_data)global_result=comm.allreduce(local_result,op=MPI.SUM)
언제: 매 parallelism strategy selection, 매 communication overhead analysis, 매 NCCL/MPI debugging.
언제 X: 매 sequential algorithm 매 inherently — 매 Amdahl bound 의 X.
❌ 안티패턴
Premature parallelization: 매 sequential profile X → blind parallelize.
Communication-bound: 매 too fine-grained 매 chunks → 매 NCCL overhead 압도.
Load imbalance: 매 uneven shard sizes → 매 stragglers.
Race conditions: 매 shared state w/o sync.
🧪 검증 / 중복
Verified (Hennessy & Patterson 6e; Megatron-LM paper 2019; Llama 3 paper 2024; CUDA C++ Programming Guide 12.x).