f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.1 KiB
7.1 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-compute-shaders | Compute Shaders | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
Compute Shaders
매 한 줄
"매 GPU program 매 graphics pipeline 의 X — 매 arbitrary parallel computation.". Compute shader는 vertex/fragment shader 와 다르게 rendering 의 X — 매 raw SIMT 의 power. 2026년 WebGPU (browser GPU compute), CUDA, Vulkan compute, Metal compute, ML inference, particle sims, image processing 의 dominant. ML 의 attention/matmul 도 compute shader 본질.
매 핵심
매 model
- Workgroup: 매 group of threads 가 same shader 실행 (e.g. 64 or 256 threads).
- Invocation: single thread.
- Shared memory (workgroup): fast, intra-group.
- Storage buffer: GPU global memory (read/write).
- Uniform buffer: small, read-only constants.
- Dispatch: CPU 가 launches N workgroups.
매 hardware mapping
- NVIDIA: warp (32 threads), SM (streaming multiprocessor).
- AMD: wave (64 threads, RDNA: 32), CU.
- Apple: simdgroup (32), GPU core.
- Intel: subgroup, EU.
매 languages 2026
- WGSL (WebGPU): cross-platform, modern.
- HLSL (DirectX, Vulkan via DXC).
- GLSL (OpenGL, Vulkan).
- MSL (Metal Shading Language).
- CUDA C++: NVIDIA only, but mature.
- Triton (OpenAI): Python-like ML kernel DSL.
매 응용
- ML inference (matmul, attention, conv).
- Image filters (blur, edge, color grading).
- Particle systems / fluid sim.
- Physics (cloth, soft body, mass-spring).
- Cryptography (proof-of-work, hash collisions).
- Video encode/decode prep.
💻 패턴
WGSL compute shader — vector add
@group(0) @binding(0) var<storage, read> a : array<f32>;
@group(0) @binding(1) var<storage, read> b : array<f32>;
@group(0) @binding(2) var<storage, read_write> c : array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3u) {
let i = gid.x;
if (i >= arrayLength(&a)) { return; }
c[i] = a[i] + b[i];
}
WebGPU dispatch (TypeScript)
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter!.requestDevice();
const module = device.createShaderModule({ code: WGSL_SOURCE });
const pipeline = device.createComputePipeline({
layout: 'auto',
compute: { module, entryPoint: 'main' },
});
const N = 1_000_000;
const buf = (data: Float32Array) => {
const b = device.createBuffer({
size: data.byteLength,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC,
});
device.queue.writeBuffer(b, 0, data);
return b;
};
const a = buf(new Float32Array(N).fill(1));
const b = buf(new Float32Array(N).fill(2));
const c = device.createBuffer({
size: N * 4,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});
const bindGroup = device.createBindGroup({
layout: pipeline.getBindGroupLayout(0),
entries: [
{ binding: 0, resource: { buffer: a } },
{ binding: 1, resource: { buffer: b } },
{ binding: 2, resource: { buffer: c } },
],
});
const enc = device.createCommandEncoder();
const pass = enc.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(Math.ceil(N / 64));
pass.end();
device.queue.submit([enc.finish()]);
Shared memory reduction (workgroup-local)
var<workgroup> shared : array<f32, 256>;
@compute @workgroup_size(256)
fn reduce(
@builtin(local_invocation_id) lid : vec3u,
@builtin(global_invocation_id) gid : vec3u,
) {
shared[lid.x] = input[gid.x];
workgroupBarrier();
var stride : u32 = 128u;
loop {
if (stride == 0u) { break; }
if (lid.x < stride) {
shared[lid.x] = shared[lid.x] + shared[lid.x + stride];
}
workgroupBarrier();
stride = stride / 2u;
}
if (lid.x == 0u) { output[gid.x / 256u] = shared[0]; }
}
CUDA matmul kernel
__global__ void matmul(const float* A, const float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= N || col >= N) return;
float sum = 0.0f;
for (int k = 0; k < N; ++k) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
// Launch: matmul<<<dim3((N+15)/16, (N+15)/16), dim3(16,16)>>>(A, B, C, N);
Triton kernel (Python ML)
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
pid = tl.program_id(0)
offsets = pid * BLOCK + tl.arange(0, BLOCK)
mask = offsets < n
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
tl.store(out_ptr + offsets, x + y, mask=mask)
# Launch
add_kernel[(triton.cdiv(N, 1024),)](x, y, out, N, BLOCK=1024)
Image blur compute (storage texture)
@group(0) @binding(0) var src : texture_2d<f32>;
@group(0) @binding(1) var dst : texture_storage_2d<rgba8unorm, write>;
@compute @workgroup_size(8, 8)
fn blur(@builtin(global_invocation_id) gid : vec3u) {
var sum = vec4f(0);
for (var dy = -1; dy <= 1; dy++) {
for (var dx = -1; dx <= 1; dx++) {
let p = vec2i(gid.xy) + vec2i(dx, dy);
sum = sum + textureLoad(src, p, 0);
}
}
textureStore(dst, vec2i(gid.xy), sum / 9.0);
}
매 결정 기준
| 상황 | Approach |
|---|---|
| Browser, cross-platform | WebGPU + WGSL |
| Native cross-platform | Vulkan compute + GLSL/HLSL |
| NVIDIA-only, max perf | CUDA |
| Apple ecosystem | Metal compute (MSL) |
| ML kernel research | Triton (Python) |
| Production ML inference | Pre-built (cuDNN, MLX, vLLM kernels) |
기본값: 매 web/cross-platform → WebGPU + WGSL. 매 ML research → Triton. 매 production NVIDIA ML → CUDA + cuDNN/cuBLAS.
🔗 Graph
- 부모: GPU Programming · Parallel Computing
- 변형: Vertex Shader · Fragment Shader · CUDA
- 응용: WebGPU
- Adjacent: Triton · Vulkan
🤖 LLM 활용
언제: heavy parallel data (image, ML, sim), browser GPU compute, custom ML kernels. 언제 X: small data (<10k items, CPU faster after transfer cost), branch-heavy serial logic, very small kernels (launch overhead).
❌ 안티패턴
- Divergent branching in warp: 매 thread 가 different path → serialization → 매 SIMT 의 X.
- Uncoalesced memory access: random pattern → bandwidth waste — adjacent threads should read adjacent memory.
- Tiny dispatch: 100 threads → launch overhead > work — batch.
- Forgetting workgroupBarrier: race condition on shared memory.
- CPU↔GPU ping-pong: every step copies back — keep data on GPU.
🧪 검증 / 중복
- Verified (WebGPU spec 2026 W3C / CUDA Programming Guide 12.x / Triton docs / Apple MSL).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — WGSL/CUDA/Triton + workgroup model |