---
id: wiki-2026-0508-memory-hierarchy
title: Memory Hierarchy
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Cache Hierarchy, Memory Pyramid, Storage Hierarchy]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [systems, performance, hardware, gpu, cache, memory]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: c-cpp-cuda, framework: systems }
---

# Memory Hierarchy

## 매 한 줄
> **"매 빠를수록 작고 비싸다"**. Register → L1 → L2 → L3 → DRAM → SSD → 네트워크. 각 단계는 ~10배 느려지고 ~10배 커지며, 최적화는 캐시 친화 코드의 99%다.

## 매 핵심
### 매 계층 (CPU 2026 기준)
| 레벨 | 크기 | 지연 | 처리량 |
|---|---|---|---|
| Register | KB 미만 | <1ns | 수 TB/s |
| L1 cache | 32-64 KB/core | ~1ns | 수 TB/s |
| L2 cache | 256KB-1MB/core | ~3-5ns | 수백 GB/s |
| L3 cache | 32-128MB | ~10-30ns | 수백 GB/s |
| DRAM | 64GB-1TB | ~80-100ns | 50-100 GB/s |
| NVMe SSD | TB | ~10-100µs | 7-14 GB/s |
| Network | ∞ | ms | Gb/s |

### 매 GPU 계층
| 레벨 | 위치 | 특징 |
|---|---|---|
| Register | 스레드별 | 가장 빠름 |
| Shared memory / SMEM | block 내 공유 | ~수십 KB |
| L1 / L2 cache | SM/global | 자동 관리 |
| HBM (global) | GPU board | A100 80GB, H100 80GB, B100 192GB |
| Host RAM | CPU측 | PCIe/NVLink |
| NVLink/InfiniBand | GPU 간 | 분산 학습 |

### 매 원리
- **Locality**: temporal (재사용), spatial (인접 접근).
- **Cache line**: 64B (CPU), 128B (GPU 트랜잭션).
- **Coalescing**: 인접 스레드가 인접 주소 → 한 transaction.
- **Bandwidth-bound vs compute-bound**: roofline 모델.

## 💻 패턴

### Pattern 1 — Cache-friendly loop (row-major)
```c
for (int i = 0; i < N; i++)        // outer
    for (int j = 0; j < N; j++)    // inner — 연속 메모리
        sum += A[i*N + j];
// j와 i 바꾸면 캐시 미스 폭증
```

### Pattern 2 — Tiling (matmul)
```c
for (int ii=0; ii<N; ii+=TILE)
  for (int jj=0; jj<N; jj+=TILE)
    for (int kk=0; kk<N; kk+=TILE)
      // TILE x TILE 블록을 L1에 보존
```

### Pattern 3 — CUDA Shared Memory
```cuda
__shared__ float tile[32][32];
tile[ty][tx] = A[row*N + (k*32 + tx)];
__syncthreads();
// HBM 한 번 읽고 32번 재사용
```

### Pattern 4 — Coalesced Access
```cuda
// Good: thread i reads a[i] — 인접
// Bad: thread i reads a[i*stride] — 산발
int idx = blockIdx.x * blockDim.x + threadIdx.x;
out[idx] = in[idx];  // coalesced
```

### Pattern 5 — Prefetching
```c
__builtin_prefetch(&a[i+16], 0, 1);
// 다음 라인 미리 읽기 — pointer chasing에 유효
```

### Pattern 6 — Roofline 측정
```python
# arithmetic_intensity = FLOPs / bytes
# < ridge → memory-bound, > ridge → compute-bound
ai = total_flops / bytes_transferred
peak = min(peak_flops, peak_bw * ai)
```

### Pattern 7 — FlashAttention (계층 인식)
```python
# Q,K,V 타일을 SRAM에서 처리, HBM 왕복 제거
# softmax도 online 알고리즘으로 SRAM 안에서
```

## 매 결정 기준
| 증상 | 원인 / 대응 |
|---|---|
| CPU 50% but slow | Memory bandwidth saturation → blocking |
| Cache miss 높음 | Tiling, struct of arrays |
| GPU achieved BW < 50% peak | Coalescing 점검 |
| HBM bound | Kernel fusion (FlashAttention 식) |
| Disk swap | Working set > RAM → batch 줄이기 |
| GPU OOM | Activation checkpointing, offload |

**기본값**: 측정(perf, ncu) 후 hot loop tiling/fusion부터.

## 🔗 Graph
- 부모: [[Performance-Optimization]]
- 응용: [[Flash Attention]], [[Kernel-Fusion]]
- Adjacent: [[HBM]], [[NVLink]]

## 🤖 LLM 활용
**언제**:
- Roofline 분석 / 병목 가설.
- Tiling 코드 변환 초안 (CUDA/CPU).
- Cache miss 디버깅 단서.

**언제 X**:
- 정확한 하드웨어 스펙 (벤더 문서 필수).
- 실측 없이 최적화 효과 단언.

## ❌ 안티패턴
- 측정 없이 최적화 (LLC bound인지 모름).
- Column-major 순회 (row-major 데이터에서).
- Pointer chasing (linked list가 array보다 10배 느림).
- GPU에 작은 작업 다수 (kernel launch overhead).
- False sharing (동일 cache line 다른 코어 쓰기).
- Hugepage / NUMA pinning 무시.

## 🧪 검증 / 중복
- Verified. 2026 H100/B100, modern x86 기준. 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup |