--- id: wiki-2026-0508-hardware title: Hardware category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Hardware Basics, 하드웨어] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [hardware, performance, systems] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: c framework: systems --- # Hardware ## 매 한 줄 > **"매 software 의 ceiling 은 hardware 의 reality"**. 매 modern stack (cloud LLM, browser, game) 의 performance 는 매 CPU pipeline, cache hierarchy, memory bandwidth, storage tier, NIC, GPU 의 understanding 없이 explain 불가. 2026 시점 Apple M4 Max, Nvidia H200/B200, AMD MI300X, NVMe Gen5 의 가 baseline. ## 매 핵심 ### 매 Latency numbers (Jeff Dean, 2026 update) - L1 cache: ~1 ns - L2 cache: ~3-4 ns - L3 cache: ~10-15 ns - DRAM: ~80-100 ns - NVMe Gen5 random read: ~10-20 µs - SSD SATA random read: ~100 µs - Same-DC RTT: ~0.5 ms - Cross-region RTT: 50-150 ms ### 매 CPU - **Pipeline**: fetch, decode, execute, mem, writeback — 매 superscalar + OoO. - **Branch predictor**: 매 mispredict = 15-20 cycle penalty. - **SIMD**: AVX-512, NEON, SVE2. - **NUMA**: multi-socket 시 매 local memory 우선. ### 매 Memory hierarchy - **Cache line**: 매 64 byte 단위 — 매 false sharing 회피의 단위. - **TLB**: 매 page translation cache — miss 매 expensive. - **HBM (GPU)**: H100 80GB @ 3.35 TB/s, B200 192GB @ 8 TB/s. ### 매 Storage / IO - NVMe Gen5: ~14 GB/s seq, 매 millions IOPS. - io_uring: 매 syscall 의 batched submission. - RDMA: 매 kernel bypass network. ### 매 응용 1. Latency-sensitive trading / gaming. 2. ML training (HBM, NVLink). 3. Database (page cache, WAL, SSD wear). 4. Browser rendering (GPU compositing). ## 💻 패턴 ### Cache-line aware (false sharing 회피) ```c #include struct counters { alignas(64) _Atomic long a; // 매 separate cache line alignas(64) _Atomic long b; }; ``` ### SIMD (AVX2 dot product) ```c #include float dot(const float* a, const float* b, int n) { __m256 acc = _mm256_setzero_ps(); for (int i = 0; i < n; i += 8) { __m256 va = _mm256_loadu_ps(a + i); __m256 vb = _mm256_loadu_ps(b + i); acc = _mm256_fmadd_ps(va, vb, acc); } float buf[8]; _mm256_storeu_ps(buf, acc); float s = 0; for (int i = 0; i < 8; i++) s += buf[i]; return s; } ``` ### Prefetch hint ```c for (int i = 0; i < n; i++) { __builtin_prefetch(&arr[i + 16]); process(arr[i]); } ``` ### Linux perf (hardware counter) ```bash perf stat -e cycles,instructions,cache-misses,branch-misses ./bench perf record -g ./bench && perf report ``` ### io_uring (high-IOPS read) ```c struct io_uring ring; io_uring_queue_init(256, &ring, 0); struct io_uring_sqe* sqe = io_uring_get_sqe(&ring); io_uring_prep_read(sqe, fd, buf, len, offset); io_uring_submit(&ring); ``` ### NUMA pin ```bash numactl --cpunodebind=0 --membind=0 ./server ``` ### GPU memcpy bandwidth (CUDA) ```cpp cudaMemcpyAsync(d_x, h_x, n*sizeof(float), cudaMemcpyHostToDevice, s); // 매 H100 PCIe Gen5: ~50 GB/s, NVLink: ~900 GB/s ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Hot loop 의 memory bound | SIMD + cache blocking + prefetch | | 다중 thread counter | per-thread + cache-line padding | | Random small IO | NVMe + io_uring | | Sequential large IO | mmap or O_DIRECT | | LLM inference | GPU (HBM bw 가 bottleneck) | | Multi-socket | NUMA pin + local alloc | **기본값**: 매 측정 먼저 (perf, FlameGraph) — 매 추측 X. ## 🔗 Graph - 변형: [[GPU]] · [[Memory Hierarchy]] ## 🤖 LLM 활용 **언제**: 매 latency budget 분석, hardware-software co-design, 매 capacity planning. **언제 X**: 매 high-level CRUD app — 매 framework default 면 충분. ## ❌ 안티패턴 - **False sharing**: 매 동일 cache line 을 multiple thread 가 write. - **Pointer chasing in hot loop**: 매 cache miss 행렬. - **Ignoring NUMA**: 매 multi-socket 에서 cross-node 매 access bottleneck. - **Sync syscall in hot path**: 매 io_uring / batching 으로 amortize. - **Bandwidth ≠ latency 혼동**: HBM 8 TB/s 라도 매 latency 는 ~수백 ns. ## 🧪 검증 / 중복 - Verified (Hennessy & Patterson *Computer Architecture* 7ed 2024, Intel SDM, Nvidia H200/B200 whitepaper, Linux kernel docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — latency numbers + CPU/GPU/NVMe 2026 baseline 정리 |