--- id: wiki-2026-0508-compute-shaders title: Compute Shaders category: 10_Wiki/Topics status: verified canonical_id: self aliases: [GPU Compute, GPGPU Shaders, WebGPU Compute] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [gpu, shaders, webgpu, parallel, wgsl, cuda] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: wgsl-glsl-cuda framework: webgpu-vulkan --- # Compute Shaders ## 매 한 줄 > **"매 GPU program 매 graphics pipeline 의 X — 매 arbitrary parallel computation."**. Compute shader는 vertex/fragment shader 와 다르게 rendering 의 X — 매 raw SIMT 의 power. 2026년 WebGPU (browser GPU compute), CUDA, Vulkan compute, Metal compute, ML inference, particle sims, image processing 의 dominant. ML 의 attention/matmul 도 compute shader 본질. ## 매 핵심 ### 매 model - **Workgroup**: 매 group of threads 가 same shader 실행 (e.g. 64 or 256 threads). - **Invocation**: single thread. - **Shared memory** (workgroup): fast, intra-group. - **Storage buffer**: GPU global memory (read/write). - **Uniform buffer**: small, read-only constants. - **Dispatch**: CPU 가 launches N workgroups. ### 매 hardware mapping - NVIDIA: warp (32 threads), SM (streaming multiprocessor). - AMD: wave (64 threads, RDNA: 32), CU. - Apple: simdgroup (32), GPU core. - Intel: subgroup, EU. ### 매 languages 2026 - **WGSL** (WebGPU): cross-platform, modern. - **HLSL** (DirectX, Vulkan via DXC). - **GLSL** (OpenGL, Vulkan). - **MSL** (Metal Shading Language). - **CUDA C++**: NVIDIA only, but mature. - **Triton** (OpenAI): Python-like ML kernel DSL. ### 매 응용 1. ML inference (matmul, attention, conv). 2. Image filters (blur, edge, color grading). 3. Particle systems / fluid sim. 4. Physics (cloth, soft body, mass-spring). 5. Cryptography (proof-of-work, hash collisions). 6. Video encode/decode prep. ## 💻 패턴 ### WGSL compute shader — vector add ```wgsl @group(0) @binding(0) var a : array; @group(0) @binding(1) var b : array; @group(0) @binding(2) var c : array; @compute @workgroup_size(64) fn main(@builtin(global_invocation_id) gid : vec3u) { let i = gid.x; if (i >= arrayLength(&a)) { return; } c[i] = a[i] + b[i]; } ``` ### WebGPU dispatch (TypeScript) ```typescript const adapter = await navigator.gpu.requestAdapter(); const device = await adapter!.requestDevice(); const module = device.createShaderModule({ code: WGSL_SOURCE }); const pipeline = device.createComputePipeline({ layout: 'auto', compute: { module, entryPoint: 'main' }, }); const N = 1_000_000; const buf = (data: Float32Array) => { const b = device.createBuffer({ size: data.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC, }); device.queue.writeBuffer(b, 0, data); return b; }; const a = buf(new Float32Array(N).fill(1)); const b = buf(new Float32Array(N).fill(2)); const c = device.createBuffer({ size: N * 4, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC, }); const bindGroup = device.createBindGroup({ layout: pipeline.getBindGroupLayout(0), entries: [ { binding: 0, resource: { buffer: a } }, { binding: 1, resource: { buffer: b } }, { binding: 2, resource: { buffer: c } }, ], }); const enc = device.createCommandEncoder(); const pass = enc.beginComputePass(); pass.setPipeline(pipeline); pass.setBindGroup(0, bindGroup); pass.dispatchWorkgroups(Math.ceil(N / 64)); pass.end(); device.queue.submit([enc.finish()]); ``` ### Shared memory reduction (workgroup-local) ```wgsl var shared : array; @compute @workgroup_size(256) fn reduce( @builtin(local_invocation_id) lid : vec3u, @builtin(global_invocation_id) gid : vec3u, ) { shared[lid.x] = input[gid.x]; workgroupBarrier(); var stride : u32 = 128u; loop { if (stride == 0u) { break; } if (lid.x < stride) { shared[lid.x] = shared[lid.x] + shared[lid.x + stride]; } workgroupBarrier(); stride = stride / 2u; } if (lid.x == 0u) { output[gid.x / 256u] = shared[0]; } } ``` ### CUDA matmul kernel ```cuda __global__ void matmul(const float* A, const float* B, float* C, int N) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; if (row >= N || col >= N) return; float sum = 0.0f; for (int k = 0; k < N; ++k) { sum += A[row * N + k] * B[k * N + col]; } C[row * N + col] = sum; } // Launch: matmul<<>>(A, B, C, N); ``` ### Triton kernel (Python ML) ```python import triton import triton.language as tl @triton.jit def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr): pid = tl.program_id(0) offsets = pid * BLOCK + tl.arange(0, BLOCK) mask = offsets < n x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) tl.store(out_ptr + offsets, x + y, mask=mask) # Launch add_kernel[(triton.cdiv(N, 1024),)](x, y, out, N, BLOCK=1024) ``` ### Image blur compute (storage texture) ```wgsl @group(0) @binding(0) var src : texture_2d; @group(0) @binding(1) var dst : texture_storage_2d; @compute @workgroup_size(8, 8) fn blur(@builtin(global_invocation_id) gid : vec3u) { var sum = vec4f(0); for (var dy = -1; dy <= 1; dy++) { for (var dx = -1; dx <= 1; dx++) { let p = vec2i(gid.xy) + vec2i(dx, dy); sum = sum + textureLoad(src, p, 0); } } textureStore(dst, vec2i(gid.xy), sum / 9.0); } ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Browser, cross-platform | WebGPU + WGSL | | Native cross-platform | Vulkan compute + GLSL/HLSL | | NVIDIA-only, max perf | CUDA | | Apple ecosystem | Metal compute (MSL) | | ML kernel research | Triton (Python) | | Production ML inference | Pre-built (cuDNN, MLX, vLLM kernels) | **기본값**: 매 web/cross-platform → WebGPU + WGSL. 매 ML research → Triton. 매 production NVIDIA ML → CUDA + cuDNN/cuBLAS. ## 🔗 Graph - 부모: [[GPU Programming]] · [[Parallel Computing]] - 변형: [[Vertex Shader]] · [[Fragment Shader]] · [[CUDA]] - 응용: [[WebGPU]] - Adjacent: [[Triton]] · [[Vulkan]] ## 🤖 LLM 활용 **언제**: heavy parallel data (image, ML, sim), browser GPU compute, custom ML kernels. **언제 X**: small data (<10k items, CPU faster after transfer cost), branch-heavy serial logic, very small kernels (launch overhead). ## ❌ 안티패턴 - **Divergent branching in warp**: 매 thread 가 different path → serialization → 매 SIMT 의 X. - **Uncoalesced memory access**: random pattern → bandwidth waste — adjacent threads should read adjacent memory. - **Tiny dispatch**: 100 threads → launch overhead > work — batch. - **Forgetting workgroupBarrier**: race condition on shared memory. - **CPU↔GPU ping-pong**: every step copies back — keep data on GPU. ## 🧪 검증 / 중복 - Verified (WebGPU spec 2026 W3C / CUDA Programming Guide 12.x / Triton docs / Apple MSL). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — WGSL/CUDA/Triton + workgroup model |