Files
2nd/10_Wiki/Topics/AI_and_ML/Matrix-Operations-and-AI.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-matrix-operations-and-ai Matrix Operations and AI 10_Wiki/Topics verified self
Matrix Ops
MatMul
GEMM
Tensor Ops
none A 0.9 applied
ai
ml
math
linear-algebra
gpu
performance
2026-05-10 pending
language framework
python pytorch-numpy-jax

Matrix Operations and AI

매 한 줄

"매 모델은 결국 행렬 곱이다". Transformer/CNN/RNN 모두 GEMM(General Matrix Multiply) 호출의 그래프이고, 성능은 BLAS·cuBLAS·Tensor Core 활용도로 결정된다.

매 핵심

매 핵심 연산

  • MatMul (GEMM): C = A @ B. FLOPs = 2·M·N·K. 모든 dense layer의 본질.
  • Element-wise: ReLU, add, multiply. Memory-bound.
  • Reduction: sum/mean/max. Softmax, LayerNorm 핵심.
  • Broadcasting: shape 자동 확장 (NumPy/PyTorch convention).
  • Einsum: einsum('bij,bjk->bik') - batched matmul 표현.

매 응용

  1. Attention: softmax(QK^T / √d) V - 4번의 matmul.
  2. Conv2d: im2col로 matmul로 변환하거나 Winograd/FFT.
  3. Embedding lookup: sparse matmul (one-hot @ W).
  4. LayerNorm/RMSNorm: reduction + element-wise.
  5. Mixture of Experts: grouped matmul (분산 라우팅).

💻 패턴

Pattern 1 — Basic MatMul (PyTorch)

import torch
A = torch.randn(128, 256, device='cuda')
B = torch.randn(256, 512, device='cuda')
C = A @ B  # or torch.matmul(A, B)
# Batched: (B, M, K) @ (B, K, N) -> (B, M, N)

Pattern 2 — Einsum (명시적)

# Attention scores: batch, heads, seq_q, seq_k
scores = torch.einsum('bhqd,bhkd->bhqk', Q, K)
# 명시적이라 transpose 실수 방지

Pattern 3 — Broadcasting 주의

a = torch.randn(32, 1, 128)
b = torch.randn(1, 64, 128)
c = a + b  # (32, 64, 128) — shape mismatch 시 silent bug

Pattern 4 — Mixed Precision (Tensor Core)

with torch.autocast('cuda', dtype=torch.bfloat16):
    out = model(x)  # GEMM은 bf16, accumulate는 fp32
# A100/H100에서 2-8배 throughput

Pattern 5 — Fused Kernel (FlashAttention)

from torch.nn.functional import scaled_dot_product_attention
out = scaled_dot_product_attention(Q, K, V, is_causal=True)
# Q@K^T → softmax → @V를 SRAM에서 한 번에 (HBM 왕복 제거)

Pattern 6 — Memory Layout (contiguous)

x = x.transpose(1, 2).contiguous()  # stride 재배치
# Non-contiguous matmul은 성능 급락

Pattern 7 — torch.compile (kernel fusion)

@torch.compile
def block(x): return F.gelu(x @ W1) @ W2
# Inductor가 element-wise를 GEMM 주변에 fuse

Pattern 8 — JAX/XLA

import jax.numpy as jnp
@jax.jit
def fwd(x, W): return jnp.einsum('bd,dk->bk', x, W)

매 결정 기준

상황 Approach
표준 dense layer nn.Linear (cuBLAS GEMM)
복잡한 contraction einsum (가독성)
Attention scaled_dot_product_attention (FlashAttn)
작은 batch / inference mixed precision + compile
Custom op Triton 또는 CUDA kernel
분산 학습 tensor parallel (megatron-style)

기본값: PyTorch 2.x + bf16 + torch.compile + FlashAttention.

🔗 Graph

🤖 LLM 활용

언제:

  • Einsum 표기 작성/디버깅 (shape mismatch 검증).
  • Custom matmul 변형 → Triton 코드 초안.
  • Memory-bound vs compute-bound 분석 결정.

언제 X:

  • 정확한 FLOPs/메모리 측정 (실측 도구 사용).
  • 최신 cuBLAS/cutlass 튜닝 파라미터.

안티패턴

  • Python loop로 matmul (for i in range: C[i] = ...) — 1000배 느림.
  • Non-contiguous tensor에 matmul 반복.
  • fp32만 고집 (Tensor Core 미사용).
  • Broadcasting 의도하지 않은 곳에서 발생.
  • 작은 matmul 다수 호출 (kernel launch overhead).

🧪 검증 / 중복

  • Verified. PyTorch 2.5/CUDA 12.x 기준. 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup