Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

4.3 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Matrix Operations and AI

매 한 줄

"매 모델은 결국 행렬 곱이다". Transformer/CNN/RNN 모두 GEMM(General Matrix Multiply) 호출의 그래프이고, 성능은 BLAS·cuBLAS·Tensor Core 활용도로 결정된다.

매 핵심

매 핵심 연산

MatMul (GEMM): C = A @ B. FLOPs = 2·M·N·K. 모든 dense layer의 본질.
Element-wise: ReLU, add, multiply. Memory-bound.
Reduction: sum/mean/max. Softmax, LayerNorm 핵심.
Broadcasting: shape 자동 확장 (NumPy/PyTorch convention).
Einsum: einsum('bij,bjk->bik') - batched matmul 표현.

매 응용

Attention: softmax(QK^T / √d) V - 4번의 matmul.
Conv2d: im2col로 matmul로 변환하거나 Winograd/FFT.
Embedding lookup: sparse matmul (one-hot @ W).
LayerNorm/RMSNorm: reduction + element-wise.
Mixture of Experts: grouped matmul (분산 라우팅).

💻 패턴

Pattern 1 — Basic MatMul (PyTorch)

import torch
A = torch.randn(128, 256, device='cuda')
B = torch.randn(256, 512, device='cuda')
C = A @ B  # or torch.matmul(A, B)
# Batched: (B, M, K) @ (B, K, N) -> (B, M, N)

Pattern 2 — Einsum (명시적)

# Attention scores: batch, heads, seq_q, seq_k
scores = torch.einsum('bhqd,bhkd->bhqk', Q, K)
# 명시적이라 transpose 실수 방지

Pattern 3 — Broadcasting 주의

a = torch.randn(32, 1, 128)
b = torch.randn(1, 64, 128)
c = a + b  # (32, 64, 128) — shape mismatch 시 silent bug

Pattern 4 — Mixed Precision (Tensor Core)

with torch.autocast('cuda', dtype=torch.bfloat16):
    out = model(x)  # GEMM은 bf16, accumulate는 fp32
# A100/H100에서 2-8배 throughput

Pattern 5 — Fused Kernel (FlashAttention)

from torch.nn.functional import scaled_dot_product_attention
out = scaled_dot_product_attention(Q, K, V, is_causal=True)
# Q@K^T → softmax → @V를 SRAM에서 한 번에 (HBM 왕복 제거)

Pattern 6 — Memory Layout (contiguous)

x = x.transpose(1, 2).contiguous()  # stride 재배치
# Non-contiguous matmul은 성능 급락

Pattern 7 — torch.compile (kernel fusion)

@torch.compile
def block(x): return F.gelu(x @ W1) @ W2
# Inductor가 element-wise를 GEMM 주변에 fuse

Pattern 8 — JAX/XLA

import jax.numpy as jnp
@jax.jit
def fwd(x, W): return jnp.einsum('bd,dk->bk', x, W)

매 결정 기준

상황	Approach
표준 dense layer	`nn.Linear` (cuBLAS GEMM)
복잡한 contraction	`einsum` (가독성)
Attention	`scaled_dot_product_attention` (FlashAttn)
작은 batch / inference	mixed precision + compile
Custom op	Triton 또는 CUDA kernel
분산 학습	tensor parallel (megatron-style)

기본값: PyTorch 2.x + bf16 + torch.compile + FlashAttention.

🔗 Graph

부모: Linear-Algebra, Deep-Learning
변형: Tensor-Operations, Einsum, Sparse-Matrices
응용: Attention-Mechanism, Convolution, Transformer
Adjacent: GPU-Architecture, CUDA-Programming, Memory-Hierarchy, FlashAttention, Mixed-Precision-Training

🤖 LLM 활용

언제:

Einsum 표기 작성/디버깅 (shape mismatch 검증).
Custom matmul 변형 → Triton 코드 초안.
Memory-bound vs compute-bound 분석 결정.

언제 X:

정확한 FLOPs/메모리 측정 (실측 도구 사용).
최신 cuBLAS/cutlass 튜닝 파라미터.

❌ 안티패턴

Python loop로 matmul (for i in range: C[i] = ...) — 1000배 느림.
Non-contiguous tensor에 matmul 반복.
fp32만 고집 (Tensor Core 미사용).
Broadcasting 의도하지 않은 곳에서 발생.
작은 matmul 다수 호출 (kernel launch overhead).

🧪 검증 / 중복

Verified. PyTorch 2.5/CUDA 12.x 기준. 신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup

4.3 KiB Raw Blame History