id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id
title
category
status
canonical_id
aliases
duplicate_of
source_trust_level
confidence_score
verification_status
tags
raw_sources
last_reinforced
github_commit
tech_stack
wiki-2026-0508-matrix-operations-and-ai
Matrix Operations and AI
10_Wiki/Topics
verified
self
Matrix Ops
MatMul
GEMM
Tensor Ops
none
A
0.9
applied
ai
ml
math
linear-algebra
gpu
performance
2026-05-10
pending
language
framework
python
pytorch-numpy-jax
Matrix Operations and AI
매 한 줄
"매 모델은 결국 행렬 곱이다" . Transformer/CNN/RNN 모두 GEMM(General Matrix Multiply) 호출의 그래프이고, 성능은 BLAS·cuBLAS·Tensor Core 활용도로 결정된다.
매 핵심
매 핵심 연산
MatMul (GEMM) : C = A @ B. FLOPs = 2·M·N·K. 모든 dense layer의 본질.
Element-wise : ReLU, add, multiply. Memory-bound.
Reduction : sum/mean/max. Softmax, LayerNorm 핵심.
Broadcasting : shape 자동 확장 (NumPy/PyTorch convention).
Einsum : einsum('bij,bjk->bik') - batched matmul 표현.
매 응용
Attention : softmax(QK^T / √d) V - 4번의 matmul.
Conv2d : im2col로 matmul로 변환하거나 Winograd/FFT.
Embedding lookup : sparse matmul (one-hot @ W).
LayerNorm/RMSNorm : reduction + element-wise.
Mixture of Experts : grouped matmul (분산 라우팅).
💻 패턴
Pattern 1 — Basic MatMul (PyTorch)
Pattern 2 — Einsum (명시적)
Pattern 3 — Broadcasting 주의
Pattern 4 — Mixed Precision (Tensor Core)
Pattern 5 — Fused Kernel (FlashAttention)
Pattern 6 — Memory Layout (contiguous)
Pattern 7 — torch.compile (kernel fusion)
Pattern 8 — JAX/XLA
매 결정 기준
상황
Approach
표준 dense layer
nn.Linear (cuBLAS GEMM)
복잡한 contraction
einsum (가독성)
Attention
scaled_dot_product_attention (FlashAttn)
작은 batch / inference
mixed precision + compile
Custom op
Triton 또는 CUDA kernel
분산 학습
tensor parallel (megatron-style)
기본값 : PyTorch 2.x + bf16 + torch.compile + FlashAttention.
🔗 Graph
🤖 LLM 활용
언제 :
Einsum 표기 작성/디버깅 (shape mismatch 검증).
Custom matmul 변형 → Triton 코드 초안.
Memory-bound vs compute-bound 분석 결정.
언제 X :
정확한 FLOPs/메모리 측정 (실측 도구 사용).
최신 cuBLAS/cutlass 튜닝 파라미터.
❌ 안티패턴
Python loop로 matmul (for i in range: C[i] = ...) — 1000배 느림.
Non-contiguous tensor에 matmul 반복.
fp32만 고집 (Tensor Core 미사용).
Broadcasting 의도하지 않은 곳에서 발생.
작은 matmul 다수 호출 (kernel launch overhead).
🧪 검증 / 중복
Verified. PyTorch 2.5/CUDA 12.x 기준. 신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup