Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

4.4 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Memory Hierarchy

매 한 줄

"매 빠를수록 작고 비싸다". Register → L1 → L2 → L3 → DRAM → SSD → 네트워크. 각 단계는 ~10배 느려지고 ~10배 커지며, 최적화는 캐시 친화 코드의 99%다.

매 핵심

매 계층 (CPU 2026 기준)

레벨	크기	지연	처리량
Register	KB 미만	<1ns	수 TB/s
L1 cache	32-64 KB/core	~1ns	수 TB/s
L2 cache	256KB-1MB/core	~3-5ns	수백 GB/s
L3 cache	32-128MB	~10-30ns	수백 GB/s
DRAM	64GB-1TB	~80-100ns	50-100 GB/s
NVMe SSD	TB	~10-100µs	7-14 GB/s
Network	∞	ms	Gb/s

매 GPU 계층

레벨	위치	특징
Register	스레드별	가장 빠름
Shared memory / SMEM	block 내 공유	~수십 KB
L1 / L2 cache	SM/global	자동 관리
HBM (global)	GPU board	A100 80GB, H100 80GB, B100 192GB
Host RAM	CPU측	PCIe/NVLink
NVLink/InfiniBand	GPU 간	분산 학습

매 원리

Locality: temporal (재사용), spatial (인접 접근).
Cache line: 64B (CPU), 128B (GPU 트랜잭션).
Coalescing: 인접 스레드가 인접 주소 → 한 transaction.
Bandwidth-bound vs compute-bound: roofline 모델.

💻 패턴

Pattern 1 — Cache-friendly loop (row-major)

for (int i = 0; i < N; i++)        // outer
    for (int j = 0; j < N; j++)    // inner — 연속 메모리
        sum += A[i*N + j];
// j와 i 바꾸면 캐시 미스 폭증

Pattern 2 — Tiling (matmul)

for (int ii=0; ii<N; ii+=TILE)
  for (int jj=0; jj<N; jj+=TILE)
    for (int kk=0; kk<N; kk+=TILE)
      // TILE x TILE 블록을 L1에 보존

Pattern 3 — CUDA Shared Memory

__shared__ float tile[32][32];
tile[ty][tx] = A[row*N + (k*32 + tx)];
__syncthreads();
// HBM 한 번 읽고 32번 재사용

Pattern 4 — Coalesced Access

// Good: thread i reads a[i] — 인접
// Bad: thread i reads a[i*stride] — 산발
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = in[idx];  // coalesced

Pattern 5 — Prefetching

__builtin_prefetch(&a[i+16], 0, 1);
// 다음 라인 미리 읽기 — pointer chasing에 유효

Pattern 6 — Roofline 측정

# arithmetic_intensity = FLOPs / bytes
# < ridge → memory-bound, > ridge → compute-bound
ai = total_flops / bytes_transferred
peak = min(peak_flops, peak_bw * ai)

Pattern 7 — FlashAttention (계층 인식)

# Q,K,V 타일을 SRAM에서 처리, HBM 왕복 제거
# softmax도 online 알고리즘으로 SRAM 안에서

매 결정 기준

증상	원인 / 대응
CPU 50% but slow	Memory bandwidth saturation → blocking
Cache miss 높음	Tiling, struct of arrays
GPU achieved BW < 50% peak	Coalescing 점검
HBM bound	Kernel fusion (FlashAttention 식)
Disk swap	Working set > RAM → batch 줄이기
GPU OOM	Activation checkpointing, offload

기본값: 측정(perf, ncu) 후 hot loop tiling/fusion부터.

🔗 Graph

🤖 LLM 활용

언제:

Roofline 분석 / 병목 가설.
Tiling 코드 변환 초안 (CUDA/CPU).
Cache miss 디버깅 단서.

언제 X:

정확한 하드웨어 스펙 (벤더 문서 필수).
실측 없이 최적화 효과 단언.

❌ 안티패턴

측정 없이 최적화 (LLC bound인지 모름).
Column-major 순회 (row-major 데이터에서).
Pointer chasing (linked list가 array보다 10배 느림).
GPU에 작은 작업 다수 (kernel launch overhead).
False sharing (동일 cache line 다른 코어 쓰기).
Hugepage / NUMA pinning 무시.

🧪 검증 / 중복

Verified. 2026 H100/B100, modern x86 기준. 신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup

4.4 KiB Raw Blame History