f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
236 lines
7.1 KiB
Markdown
236 lines
7.1 KiB
Markdown
---
|
|
id: wiki-2026-0508-compute-shaders
|
|
title: Compute Shaders
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [GPU Compute, GPGPU Shaders, WebGPU Compute]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [gpu, shaders, webgpu, parallel, wgsl, cuda]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: wgsl-glsl-cuda
|
|
framework: webgpu-vulkan
|
|
---
|
|
|
|
# Compute Shaders
|
|
|
|
## 매 한 줄
|
|
> **"매 GPU program 매 graphics pipeline 의 X — 매 arbitrary parallel computation."**. Compute shader는 vertex/fragment shader 와 다르게 rendering 의 X — 매 raw SIMT 의 power. 2026년 WebGPU (browser GPU compute), CUDA, Vulkan compute, Metal compute, ML inference, particle sims, image processing 의 dominant. ML 의 attention/matmul 도 compute shader 본질.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 model
|
|
- **Workgroup**: 매 group of threads 가 same shader 실행 (e.g. 64 or 256 threads).
|
|
- **Invocation**: single thread.
|
|
- **Shared memory** (workgroup): fast, intra-group.
|
|
- **Storage buffer**: GPU global memory (read/write).
|
|
- **Uniform buffer**: small, read-only constants.
|
|
- **Dispatch**: CPU 가 launches N workgroups.
|
|
|
|
### 매 hardware mapping
|
|
- NVIDIA: warp (32 threads), SM (streaming multiprocessor).
|
|
- AMD: wave (64 threads, RDNA: 32), CU.
|
|
- Apple: simdgroup (32), GPU core.
|
|
- Intel: subgroup, EU.
|
|
|
|
### 매 languages 2026
|
|
- **WGSL** (WebGPU): cross-platform, modern.
|
|
- **HLSL** (DirectX, Vulkan via DXC).
|
|
- **GLSL** (OpenGL, Vulkan).
|
|
- **MSL** (Metal Shading Language).
|
|
- **CUDA C++**: NVIDIA only, but mature.
|
|
- **Triton** (OpenAI): Python-like ML kernel DSL.
|
|
|
|
### 매 응용
|
|
1. ML inference (matmul, attention, conv).
|
|
2. Image filters (blur, edge, color grading).
|
|
3. Particle systems / fluid sim.
|
|
4. Physics (cloth, soft body, mass-spring).
|
|
5. Cryptography (proof-of-work, hash collisions).
|
|
6. Video encode/decode prep.
|
|
|
|
## 💻 패턴
|
|
|
|
### WGSL compute shader — vector add
|
|
```wgsl
|
|
@group(0) @binding(0) var<storage, read> a : array<f32>;
|
|
@group(0) @binding(1) var<storage, read> b : array<f32>;
|
|
@group(0) @binding(2) var<storage, read_write> c : array<f32>;
|
|
|
|
@compute @workgroup_size(64)
|
|
fn main(@builtin(global_invocation_id) gid : vec3u) {
|
|
let i = gid.x;
|
|
if (i >= arrayLength(&a)) { return; }
|
|
c[i] = a[i] + b[i];
|
|
}
|
|
```
|
|
|
|
### WebGPU dispatch (TypeScript)
|
|
```typescript
|
|
const adapter = await navigator.gpu.requestAdapter();
|
|
const device = await adapter!.requestDevice();
|
|
|
|
const module = device.createShaderModule({ code: WGSL_SOURCE });
|
|
const pipeline = device.createComputePipeline({
|
|
layout: 'auto',
|
|
compute: { module, entryPoint: 'main' },
|
|
});
|
|
|
|
const N = 1_000_000;
|
|
const buf = (data: Float32Array) => {
|
|
const b = device.createBuffer({
|
|
size: data.byteLength,
|
|
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC,
|
|
});
|
|
device.queue.writeBuffer(b, 0, data);
|
|
return b;
|
|
};
|
|
|
|
const a = buf(new Float32Array(N).fill(1));
|
|
const b = buf(new Float32Array(N).fill(2));
|
|
const c = device.createBuffer({
|
|
size: N * 4,
|
|
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
|
|
});
|
|
|
|
const bindGroup = device.createBindGroup({
|
|
layout: pipeline.getBindGroupLayout(0),
|
|
entries: [
|
|
{ binding: 0, resource: { buffer: a } },
|
|
{ binding: 1, resource: { buffer: b } },
|
|
{ binding: 2, resource: { buffer: c } },
|
|
],
|
|
});
|
|
|
|
const enc = device.createCommandEncoder();
|
|
const pass = enc.beginComputePass();
|
|
pass.setPipeline(pipeline);
|
|
pass.setBindGroup(0, bindGroup);
|
|
pass.dispatchWorkgroups(Math.ceil(N / 64));
|
|
pass.end();
|
|
device.queue.submit([enc.finish()]);
|
|
```
|
|
|
|
### Shared memory reduction (workgroup-local)
|
|
```wgsl
|
|
var<workgroup> shared : array<f32, 256>;
|
|
|
|
@compute @workgroup_size(256)
|
|
fn reduce(
|
|
@builtin(local_invocation_id) lid : vec3u,
|
|
@builtin(global_invocation_id) gid : vec3u,
|
|
) {
|
|
shared[lid.x] = input[gid.x];
|
|
workgroupBarrier();
|
|
|
|
var stride : u32 = 128u;
|
|
loop {
|
|
if (stride == 0u) { break; }
|
|
if (lid.x < stride) {
|
|
shared[lid.x] = shared[lid.x] + shared[lid.x + stride];
|
|
}
|
|
workgroupBarrier();
|
|
stride = stride / 2u;
|
|
}
|
|
|
|
if (lid.x == 0u) { output[gid.x / 256u] = shared[0]; }
|
|
}
|
|
```
|
|
|
|
### CUDA matmul kernel
|
|
```cuda
|
|
__global__ void matmul(const float* A, const float* B, float* C, int N) {
|
|
int row = blockIdx.y * blockDim.y + threadIdx.y;
|
|
int col = blockIdx.x * blockDim.x + threadIdx.x;
|
|
if (row >= N || col >= N) return;
|
|
|
|
float sum = 0.0f;
|
|
for (int k = 0; k < N; ++k) {
|
|
sum += A[row * N + k] * B[k * N + col];
|
|
}
|
|
C[row * N + col] = sum;
|
|
}
|
|
// Launch: matmul<<<dim3((N+15)/16, (N+15)/16), dim3(16,16)>>>(A, B, C, N);
|
|
```
|
|
|
|
### Triton kernel (Python ML)
|
|
```python
|
|
import triton
|
|
import triton.language as tl
|
|
|
|
@triton.jit
|
|
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
|
|
pid = tl.program_id(0)
|
|
offsets = pid * BLOCK + tl.arange(0, BLOCK)
|
|
mask = offsets < n
|
|
x = tl.load(x_ptr + offsets, mask=mask)
|
|
y = tl.load(y_ptr + offsets, mask=mask)
|
|
tl.store(out_ptr + offsets, x + y, mask=mask)
|
|
|
|
# Launch
|
|
add_kernel[(triton.cdiv(N, 1024),)](x, y, out, N, BLOCK=1024)
|
|
```
|
|
|
|
### Image blur compute (storage texture)
|
|
```wgsl
|
|
@group(0) @binding(0) var src : texture_2d<f32>;
|
|
@group(0) @binding(1) var dst : texture_storage_2d<rgba8unorm, write>;
|
|
|
|
@compute @workgroup_size(8, 8)
|
|
fn blur(@builtin(global_invocation_id) gid : vec3u) {
|
|
var sum = vec4f(0);
|
|
for (var dy = -1; dy <= 1; dy++) {
|
|
for (var dx = -1; dx <= 1; dx++) {
|
|
let p = vec2i(gid.xy) + vec2i(dx, dy);
|
|
sum = sum + textureLoad(src, p, 0);
|
|
}
|
|
}
|
|
textureStore(dst, vec2i(gid.xy), sum / 9.0);
|
|
}
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Browser, cross-platform | WebGPU + WGSL |
|
|
| Native cross-platform | Vulkan compute + GLSL/HLSL |
|
|
| NVIDIA-only, max perf | CUDA |
|
|
| Apple ecosystem | Metal compute (MSL) |
|
|
| ML kernel research | Triton (Python) |
|
|
| Production ML inference | Pre-built (cuDNN, MLX, vLLM kernels) |
|
|
|
|
**기본값**: 매 web/cross-platform → WebGPU + WGSL. 매 ML research → Triton. 매 production NVIDIA ML → CUDA + cuDNN/cuBLAS.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[GPU Programming]] · [[Parallel Computing]]
|
|
- 변형: [[Vertex Shader]] · [[Fragment Shader]] · [[CUDA]]
|
|
- 응용: [[WebGPU]]
|
|
- Adjacent: [[Triton]] · [[Vulkan]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: heavy parallel data (image, ML, sim), browser GPU compute, custom ML kernels.
|
|
**언제 X**: small data (<10k items, CPU faster after transfer cost), branch-heavy serial logic, very small kernels (launch overhead).
|
|
|
|
## ❌ 안티패턴
|
|
- **Divergent branching in warp**: 매 thread 가 different path → serialization → 매 SIMT 의 X.
|
|
- **Uncoalesced memory access**: random pattern → bandwidth waste — adjacent threads should read adjacent memory.
|
|
- **Tiny dispatch**: 100 threads → launch overhead > work — batch.
|
|
- **Forgetting workgroupBarrier**: race condition on shared memory.
|
|
- **CPU↔GPU ping-pong**: every step copies back — keep data on GPU.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (WebGPU spec 2026 W3C / CUDA Programming Guide 12.x / Triton docs / Apple MSL).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — WGSL/CUDA/Triton + workgroup model |
|