Files
2nd/10_Wiki/Topics/Architecture/Hardware.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.5 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-hardware Hardware 10_Wiki/Topics verified self
Hardware Basics
하드웨어
none A 0.9 applied
hardware
performance
systems
2026-05-10 pending
language framework
c systems

Hardware

매 한 줄

"매 software 의 ceiling 은 hardware 의 reality". 매 modern stack (cloud LLM, browser, game) 의 performance 는 매 CPU pipeline, cache hierarchy, memory bandwidth, storage tier, NIC, GPU 의 understanding 없이 explain 불가. 2026 시점 Apple M4 Max, Nvidia H200/B200, AMD MI300X, NVMe Gen5 의 가 baseline.

매 핵심

매 Latency numbers (Jeff Dean, 2026 update)

  • L1 cache: ~1 ns
  • L2 cache: ~3-4 ns
  • L3 cache: ~10-15 ns
  • DRAM: ~80-100 ns
  • NVMe Gen5 random read: ~10-20 µs
  • SSD SATA random read: ~100 µs
  • Same-DC RTT: ~0.5 ms
  • Cross-region RTT: 50-150 ms

매 CPU

  • Pipeline: fetch, decode, execute, mem, writeback — 매 superscalar + OoO.
  • Branch predictor: 매 mispredict = 15-20 cycle penalty.
  • SIMD: AVX-512, NEON, SVE2.
  • NUMA: multi-socket 시 매 local memory 우선.

매 Memory hierarchy

  • Cache line: 매 64 byte 단위 — 매 false sharing 회피의 단위.
  • TLB: 매 page translation cache — miss 매 expensive.
  • HBM (GPU): H100 80GB @ 3.35 TB/s, B200 192GB @ 8 TB/s.

매 Storage / IO

  • NVMe Gen5: ~14 GB/s seq, 매 millions IOPS.
  • io_uring: 매 syscall 의 batched submission.
  • RDMA: 매 kernel bypass network.

매 응용

  1. Latency-sensitive trading / gaming.
  2. ML training (HBM, NVLink).
  3. Database (page cache, WAL, SSD wear).
  4. Browser rendering (GPU compositing).

💻 패턴

Cache-line aware (false sharing 회피)

#include <stdalign.h>
struct counters {
    alignas(64) _Atomic long a;  // 매 separate cache line
    alignas(64) _Atomic long b;
};

SIMD (AVX2 dot product)

#include <immintrin.h>
float dot(const float* a, const float* b, int n) {
    __m256 acc = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        acc = _mm256_fmadd_ps(va, vb, acc);
    }
    float buf[8]; _mm256_storeu_ps(buf, acc);
    float s = 0; for (int i = 0; i < 8; i++) s += buf[i]; return s;
}

Prefetch hint

for (int i = 0; i < n; i++) {
    __builtin_prefetch(&arr[i + 16]);
    process(arr[i]);
}

Linux perf (hardware counter)

perf stat -e cycles,instructions,cache-misses,branch-misses ./bench
perf record -g ./bench && perf report

io_uring (high-IOPS read)

struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
struct io_uring_sqe* sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_submit(&ring);

NUMA pin

numactl --cpunodebind=0 --membind=0 ./server

GPU memcpy bandwidth (CUDA)

cudaMemcpyAsync(d_x, h_x, n*sizeof(float), cudaMemcpyHostToDevice, s);
// 매 H100 PCIe Gen5: ~50 GB/s, NVLink: ~900 GB/s

매 결정 기준

상황 Approach
Hot loop 의 memory bound SIMD + cache blocking + prefetch
다중 thread counter per-thread + cache-line padding
Random small IO NVMe + io_uring
Sequential large IO mmap or O_DIRECT
LLM inference GPU (HBM bw 가 bottleneck)
Multi-socket NUMA pin + local alloc

기본값: 매 측정 먼저 (perf, FlameGraph) — 매 추측 X.

🔗 Graph

🤖 LLM 활용

언제: 매 latency budget 분석, hardware-software co-design, 매 capacity planning. 언제 X: 매 high-level CRUD app — 매 framework default 면 충분.

안티패턴

  • False sharing: 매 동일 cache line 을 multiple thread 가 write.
  • Pointer chasing in hot loop: 매 cache miss 행렬.
  • Ignoring NUMA: 매 multi-socket 에서 cross-node 매 access bottleneck.
  • Sync syscall in hot path: 매 io_uring / batching 으로 amortize.
  • Bandwidth ≠ latency 혼동: HBM 8 TB/s 라도 매 latency 는 ~수백 ns.

🧪 검증 / 중복

  • Verified (Hennessy & Patterson Computer Architecture 7ed 2024, Intel SDM, Nvidia H200/B200 whitepaper, Linux kernel docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — latency numbers + CPU/GPU/NVMe 2026 baseline 정리