id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id
title
category
status
canonical_id
aliases
duplicate_of
source_trust_level
confidence_score
verification_status
tags
raw_sources
last_reinforced
github_commit
tech_stack
wiki-2026-0508-hardware
Hardware
10_Wiki/Topics
verified
self
none
A
0.9
applied
hardware
performance
systems
2026-05-10
pending
language
framework
c
systems
Hardware
매 한 줄
"매 software 의 ceiling 은 hardware 의 reality" . 매 modern stack (cloud LLM, browser, game) 의 performance 는 매 CPU pipeline, cache hierarchy, memory bandwidth, storage tier, NIC, GPU 의 understanding 없이 explain 불가. 2026 시점 Apple M4 Max, Nvidia H200/B200, AMD MI300X, NVMe Gen5 의 가 baseline.
매 핵심
매 Latency numbers (Jeff Dean, 2026 update)
L1 cache: ~1 ns
L2 cache: ~3-4 ns
L3 cache: ~10-15 ns
DRAM: ~80-100 ns
NVMe Gen5 random read: ~10-20 µs
SSD SATA random read: ~100 µs
Same-DC RTT: ~0.5 ms
Cross-region RTT: 50-150 ms
매 CPU
Pipeline : fetch, decode, execute, mem, writeback — 매 superscalar + OoO.
Branch predictor : 매 mispredict = 15-20 cycle penalty.
SIMD : AVX-512, NEON, SVE2.
NUMA : multi-socket 시 매 local memory 우선.
매 Memory hierarchy
Cache line : 매 64 byte 단위 — 매 false sharing 회피의 단위.
TLB : 매 page translation cache — miss 매 expensive.
HBM (GPU) : H100 80GB @ 3.35 TB/s, B200 192GB @ 8 TB/s.
매 Storage / IO
NVMe Gen5: ~14 GB/s seq, 매 millions IOPS.
io_uring: 매 syscall 의 batched submission.
RDMA: 매 kernel bypass network.
매 응용
Latency-sensitive trading / gaming.
ML training (HBM, NVLink).
Database (page cache, WAL, SSD wear).
Browser rendering (GPU compositing).
💻 패턴
Cache-line aware (false sharing 회피)
SIMD (AVX2 dot product)
Prefetch hint
Linux perf (hardware counter)
io_uring (high-IOPS read)
NUMA pin
GPU memcpy bandwidth (CUDA)
매 결정 기준
상황
Approach
Hot loop 의 memory bound
SIMD + cache blocking + prefetch
다중 thread counter
per-thread + cache-line padding
Random small IO
NVMe + io_uring
Sequential large IO
mmap or O_DIRECT
LLM inference
GPU (HBM bw 가 bottleneck)
Multi-socket
NUMA pin + local alloc
기본값 : 매 측정 먼저 (perf, FlameGraph) — 매 추측 X.
🔗 Graph
🤖 LLM 활용
언제 : 매 latency budget 분석, hardware-software co-design, 매 capacity planning.
언제 X : 매 high-level CRUD app — 매 framework default 면 충분.
❌ 안티패턴
False sharing : 매 동일 cache line 을 multiple thread 가 write.
Pointer chasing in hot loop : 매 cache miss 행렬.
Ignoring NUMA : 매 multi-socket 에서 cross-node 매 access bottleneck.
Sync syscall in hot path : 매 io_uring / batching 으로 amortize.
Bandwidth ≠ latency 혼동 : HBM 8 TB/s 라도 매 latency 는 ~수백 ns.
🧪 검증 / 중복
Verified (Hennessy & Patterson Computer Architecture 7ed 2024, Intel SDM, Nvidia H200/B200 whitepaper, Linux kernel docs).
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — latency numbers + CPU/GPU/NVMe 2026 baseline 정리