Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

7.1 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Compute Shaders

매 한 줄

"매 GPU program 매 graphics pipeline 의 X — 매 arbitrary parallel computation.". Compute shader는 vertex/fragment shader 와 다르게 rendering 의 X — 매 raw SIMT 의 power. 2026년 WebGPU (browser GPU compute), CUDA, Vulkan compute, Metal compute, ML inference, particle sims, image processing 의 dominant. ML 의 attention/matmul 도 compute shader 본질.

매 핵심

매 model

Workgroup: 매 group of threads 가 same shader 실행 (e.g. 64 or 256 threads).
Invocation: single thread.
Shared memory (workgroup): fast, intra-group.
Storage buffer: GPU global memory (read/write).
Uniform buffer: small, read-only constants.
Dispatch: CPU 가 launches N workgroups.

매 hardware mapping

NVIDIA: warp (32 threads), SM (streaming multiprocessor).
AMD: wave (64 threads, RDNA: 32), CU.
Apple: simdgroup (32), GPU core.
Intel: subgroup, EU.

매 languages 2026

WGSL (WebGPU): cross-platform, modern.
HLSL (DirectX, Vulkan via DXC).
GLSL (OpenGL, Vulkan).
MSL (Metal Shading Language).
CUDA C++: NVIDIA only, but mature.
Triton (OpenAI): Python-like ML kernel DSL.

매 응용

ML inference (matmul, attention, conv).
Image filters (blur, edge, color grading).
Particle systems / fluid sim.
Physics (cloth, soft body, mass-spring).
Cryptography (proof-of-work, hash collisions).
Video encode/decode prep.

💻 패턴

WGSL compute shader — vector add

@group(0) @binding(0) var<storage, read>      a : array<f32>;
@group(0) @binding(1) var<storage, read>      b : array<f32>;
@group(0) @binding(2) var<storage, read_write> c : array<f32>;

@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) gid : vec3u) {
  let i = gid.x;
  if (i >= arrayLength(&a)) { return; }
  c[i] = a[i] + b[i];
}

WebGPU dispatch (TypeScript)

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter!.requestDevice();

const module = device.createShaderModule({ code: WGSL_SOURCE });
const pipeline = device.createComputePipeline({
  layout: 'auto',
  compute: { module, entryPoint: 'main' },
});

const N = 1_000_000;
const buf = (data: Float32Array) => {
  const b = device.createBuffer({
    size: data.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST | GPUBufferUsage.COPY_SRC,
  });
  device.queue.writeBuffer(b, 0, data);
  return b;
};

const a = buf(new Float32Array(N).fill(1));
const b = buf(new Float32Array(N).fill(2));
const c = device.createBuffer({
  size: N * 4,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});

const bindGroup = device.createBindGroup({
  layout: pipeline.getBindGroupLayout(0),
  entries: [
    { binding: 0, resource: { buffer: a } },
    { binding: 1, resource: { buffer: b } },
    { binding: 2, resource: { buffer: c } },
  ],
});

const enc = device.createCommandEncoder();
const pass = enc.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(0, bindGroup);
pass.dispatchWorkgroups(Math.ceil(N / 64));
pass.end();
device.queue.submit([enc.finish()]);

Shared memory reduction (workgroup-local)

var<workgroup> shared : array<f32, 256>;

@compute @workgroup_size(256)
fn reduce(
  @builtin(local_invocation_id) lid : vec3u,
  @builtin(global_invocation_id) gid : vec3u,
) {
  shared[lid.x] = input[gid.x];
  workgroupBarrier();

  var stride : u32 = 128u;
  loop {
    if (stride == 0u) { break; }
    if (lid.x < stride) {
      shared[lid.x] = shared[lid.x] + shared[lid.x + stride];
    }
    workgroupBarrier();
    stride = stride / 2u;
  }

  if (lid.x == 0u) { output[gid.x / 256u] = shared[0]; }
}

CUDA matmul kernel

__global__ void matmul(const float* A, const float* B, float* C, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= N || col >= N) return;

    float sum = 0.0f;
    for (int k = 0; k < N; ++k) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}
// Launch: matmul<<<dim3((N+15)/16, (N+15)/16), dim3(16,16)>>>(A, B, C, N);

Triton kernel (Python ML)

import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offsets < n
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

# Launch
add_kernel[(triton.cdiv(N, 1024),)](x, y, out, N, BLOCK=1024)

Image blur compute (storage texture)

@group(0) @binding(0) var src : texture_2d<f32>;
@group(0) @binding(1) var dst : texture_storage_2d<rgba8unorm, write>;

@compute @workgroup_size(8, 8)
fn blur(@builtin(global_invocation_id) gid : vec3u) {
  var sum = vec4f(0);
  for (var dy = -1; dy <= 1; dy++) {
    for (var dx = -1; dx <= 1; dx++) {
      let p = vec2i(gid.xy) + vec2i(dx, dy);
      sum = sum + textureLoad(src, p, 0);
    }
  }
  textureStore(dst, vec2i(gid.xy), sum / 9.0);
}

매 결정 기준

상황	Approach
Browser, cross-platform	WebGPU + WGSL
Native cross-platform	Vulkan compute + GLSL/HLSL
NVIDIA-only, max perf	CUDA
Apple ecosystem	Metal compute (MSL)
ML kernel research	Triton (Python)
Production ML inference	Pre-built (cuDNN, MLX, vLLM kernels)

기본값: 매 web/cross-platform → WebGPU + WGSL. 매 ML research → Triton. 매 production NVIDIA ML → CUDA + cuDNN/cuBLAS.

🔗 Graph

부모: GPU Programming · Parallel Computing
변형: Vertex Shader · Fragment Shader · CUDA
응용: WebGPU
Adjacent: Triton · Vulkan

🤖 LLM 활용

언제: heavy parallel data (image, ML, sim), browser GPU compute, custom ML kernels. 언제 X: small data (<10k items, CPU faster after transfer cost), branch-heavy serial logic, very small kernels (launch overhead).

❌ 안티패턴

Divergent branching in warp: 매 thread 가 different path → serialization → 매 SIMT 의 X.
Uncoalesced memory access: random pattern → bandwidth waste — adjacent threads should read adjacent memory.
Tiny dispatch: 100 threads → launch overhead > work — batch.
Forgetting workgroupBarrier: race condition on shared memory.
CPU↔GPU ping-pong: every step copies back — keep data on GPU.

🧪 검증 / 중복

Verified (WebGPU spec 2026 W3C / CUDA Programming Guide 12.x / Triton docs / Apple MSL).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — WGSL/CUDA/Triton + workgroup model

7.1 KiB Raw Blame History