2nd/10_Wiki/Topics/Architecture/Compute_Shaders.md

---
id: wiki-2026-0508-compute-shaders
title: Compute Shaders
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [GPU Compute, GPGPU Shaders, WebGPU Compute]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [gpu, shaders, webgpu, parallel, wgsl, cuda]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: wgsl-glsl-cuda
  framework: webgpu-vulkan
---

# Compute Shaders

## 매 한 줄
> **"매 GPU program 매 graphics pipeline 의 X — 매 arbitrary parallel computation."**. Compute shader는 vertex/fragment shader 와 다르게 rendering 의 X — 매 raw SIMT 의 power. 2026년 WebGPU (browser GPU compute), CUDA, Vulkan compute, Metal compute, ML inference, particle sims, image processing 의 dominant. ML 의 attention/matmul 도 compute shader 본질.

## 매 핵심

### 매 model
- **Workgroup**: 매 group of threads 가 same shader 실행 (e.g. 64 or 256 threads).
- **Invocation**: single thread.
- **Shared memory** (workgroup): fast, intra-group.
- **Storage buffer**: GPU global memory (read/write).
- **Uniform buffer**: small, read-only constants.
- **Dispatch**: CPU 가 launches N workgroups.

### 매 hardware mapping
- NVIDIA: warp (32 threads), SM (streaming multiprocessor).
- AMD: wave (64 threads, RDNA: 32), CU.
- Apple: simdgroup (32), GPU core.
- Intel: subgroup, EU.

### 매 languages 2026
- **WGSL** (WebGPU): cross-platform, modern.
- **HLSL** (DirectX, Vulkan via DXC).
- **GLSL** (OpenGL, Vulkan).
- **MSL** (Metal Shading Language).
- **CUDA C++**: NVIDIA only, but mature.
- **Triton** (OpenAI): Python-like ML kernel DSL.

### 매 응용
1. ML inference (matmul, attention, conv).
2. Image filters (blur, edge, color grading).
3. Particle systems / fluid sim.
4. Physics (cloth, soft body, mass-spring).
5. Cryptography (proof-of-work, hash collisions).
6. Video encode/decode prep.

## 💻 패턴

### WGSL compute shader — vector add
```wgsl
@group(0) @binding(0) var<storage, read>      a : array<f32>;
@group(0) @binding(1) var<storage, read>      b : array<f32>;
@group(0) @binding(2) var<storage, read_write> c : array<f32>;

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3u) {
  let i = gid.x;
  if (i >= arrayLength(&a)) { return; }
  c[i] = a[i] + b[i];
}
```

### WebGPU dispatch (TypeScript)
```typescript
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter!.requestDevice();

const module = device.createShaderModule({ code: WGSL_SOURCE });
const pipeline = device.createComputePipeline({
  layout: 'auto',
  compute: { module, entryPoint: 'main' },
});

const N = 1_000_000;
const buf = (data: Float32Array) => {
  const b = device.createBuffer({
    size: data.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC,
  });
  device.queue.writeBuffer(b, 0, data);
  return b;
};

const a = buf(new Float32Array(N).fill(1));
const b = buf(new Float32Array(N).fill(2));
const c = device.createBuffer({
  size: N * 4,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});

const bindGroup = device.createBindGroup({
  layout: pipeline.getBindGroupLayout(0),
  entries: [
    { binding: 0, resource: { buffer: a } },
    { binding: 1, resource: { buffer: b } },
    { binding: 2, resource: { buffer: c } },
  ],
});

const enc = device.createCommandEncoder();
const pass = enc.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(Math.ceil(N / 64));
pass.end();
device.queue.submit([enc.finish()]);
```

### Shared memory reduction (workgroup-local)
```wgsl
var<workgroup> shared : array<f32, 256>;

@compute @workgroup_size(256)
fn reduce(
  @builtin(local_invocation_id) lid : vec3u,
  @builtin(global_invocation_id) gid : vec3u,
) {
  shared[lid.x] = input[gid.x];
  workgroupBarrier();

  var stride : u32 = 128u;
  loop {
    if (stride == 0u) { break; }
    if (lid.x < stride) {
      shared[lid.x] = shared[lid.x] + shared[lid.x + stride];
    }
    workgroupBarrier();
    stride = stride / 2u;
  }

  if (lid.x == 0u) { output[gid.x / 256u] = shared[0]; }
}
```

### CUDA matmul kernel
```cuda
__global__ void matmul(const float* A, const float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= N || col >= N) return;

    float sum = 0.0f;
    for (int k = 0; k < N; ++k) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}
// Launch: matmul<<<dim3((N+15)/16, (N+15)/16), dim3(16,16)>>>(A, B, C, N);
```

### Triton kernel (Python ML)
```python
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offsets < n
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

# Launch
add_kernel[(triton.cdiv(N, 1024),)](x, y, out, N, BLOCK=1024)
```

### Image blur compute (storage texture)
```wgsl
@group(0) @binding(0) var src : texture_2d<f32>;
@group(0) @binding(1) var dst : texture_storage_2d<rgba8unorm, write>;

@compute @workgroup_size(8, 8)
fn blur(@builtin(global_invocation_id) gid : vec3u) {
  var sum = vec4f(0);
  for (var dy = -1; dy <= 1; dy++) {
    for (var dx = -1; dx <= 1; dx++) {
      let p = vec2i(gid.xy) + vec2i(dx, dy);
      sum = sum + textureLoad(src, p, 0);
    }
  }
  textureStore(dst, vec2i(gid.xy), sum / 9.0);
}
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Browser, cross-platform | WebGPU + WGSL |
| Native cross-platform | Vulkan compute + GLSL/HLSL |
| NVIDIA-only, max perf | CUDA |
| Apple ecosystem | Metal compute (MSL) |
| ML kernel research | Triton (Python) |
| Production ML inference | Pre-built (cuDNN, MLX, vLLM kernels) |

**기본값**: 매 web/cross-platform → WebGPU + WGSL. 매 ML research → Triton. 매 production NVIDIA ML → CUDA + cuDNN/cuBLAS.

## 🔗 Graph
- 부모: [[GPU Programming]] · [[Parallel Computing]]
- 변형: [[Vertex Shader]] · [[Fragment Shader]] · [[CUDA]]
- 응용: [[WebGPU]]
- Adjacent: [[Triton]] · [[Vulkan]]

## 🤖 LLM 활용
**언제**: heavy parallel data (image, ML, sim), browser GPU compute, custom ML kernels.
**언제 X**: small data (<10k items, CPU faster after transfer cost), branch-heavy serial logic, very small kernels (launch overhead).

## ❌ 안티패턴
- **Divergent branching in warp**: 매 thread 가 different path → serialization → 매 SIMT 의 X.
- **Uncoalesced memory access**: random pattern → bandwidth waste — adjacent threads should read adjacent memory.
- **Tiny dispatch**: 100 threads → launch overhead > work — batch.
- **Forgetting workgroupBarrier**: race condition on shared memory.
- **CPU↔GPU ping-pong**: every step copies back — keep data on GPU.

## 🧪 검증 / 중복
- Verified (WebGPU spec 2026 W3C / CUDA Programming Guide 12.x / Triton docs / Apple MSL).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — WGSL/CUDA/Triton + workgroup model |