Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

4.5 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Hardware

매 한 줄

"매 software 의 ceiling 은 hardware 의 reality". 매 modern stack (cloud LLM, browser, game) 의 performance 는 매 CPU pipeline, cache hierarchy, memory bandwidth, storage tier, NIC, GPU 의 understanding 없이 explain 불가. 2026 시점 Apple M4 Max, Nvidia H200/B200, AMD MI300X, NVMe Gen5 의 가 baseline.

매 핵심

매 Latency numbers (Jeff Dean, 2026 update)

L1 cache: ~1 ns
L2 cache: ~3-4 ns
L3 cache: ~10-15 ns
DRAM: ~80-100 ns
NVMe Gen5 random read: ~10-20 µs
SSD SATA random read: ~100 µs
Same-DC RTT: ~0.5 ms
Cross-region RTT: 50-150 ms

매 CPU

Pipeline: fetch, decode, execute, mem, writeback — 매 superscalar + OoO.
Branch predictor: 매 mispredict = 15-20 cycle penalty.
SIMD: AVX-512, NEON, SVE2.
NUMA: multi-socket 시 매 local memory 우선.

매 Memory hierarchy

Cache line: 매 64 byte 단위 — 매 false sharing 회피의 단위.
TLB: 매 page translation cache — miss 매 expensive.
HBM (GPU): H100 80GB @ 3.35 TB/s, B200 192GB @ 8 TB/s.

매 Storage / IO

NVMe Gen5: ~14 GB/s seq, 매 millions IOPS.
io_uring: 매 syscall 의 batched submission.
RDMA: 매 kernel bypass network.

매 응용

Latency-sensitive trading / gaming.
ML training (HBM, NVLink).
Database (page cache, WAL, SSD wear).
Browser rendering (GPU compositing).

💻 패턴

#include <stdalign.h>
struct counters {
    alignas(64) _Atomic long a;  // 매 separate cache line
    alignas(64) _Atomic long b;
};

SIMD (AVX2 dot product)

#include <immintrin.h>
float dot(const float* a, const float* b, int n) {
    __m256 acc = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(a + i);
        __m256 vb = _mm256_loadu_ps(b + i);
        acc = _mm256_fmadd_ps(va, vb, acc);
    }
    float buf[8]; _mm256_storeu_ps(buf, acc);
    float s = 0; for (int i = 0; i < 8; i++) s += buf[i]; return s;
}

Prefetch hint

for (int i = 0; i < n; i++) {
    __builtin_prefetch(&arr[i + 16]);
    process(arr[i]);
}

Linux perf (hardware counter)

perf stat -e cycles,instructions,cache-misses,branch-misses ./bench
perf record -g ./bench && perf report

io_uring (high-IOPS read)

struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
struct io_uring_sqe* sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
io_uring_submit(&ring);

NUMA pin

numactl --cpunodebind=0 --membind=0 ./server

GPU memcpy bandwidth (CUDA)

cudaMemcpyAsync(d_x, h_x, n*sizeof(float), cudaMemcpyHostToDevice, s);
// 매 H100 PCIe Gen5: ~50 GB/s, NVLink: ~900 GB/s

매 결정 기준

상황	Approach
Hot loop 의 memory bound	SIMD + cache blocking + prefetch
다중 thread counter	per-thread + cache-line padding
Random small IO	NVMe + io_uring
Sequential large IO	mmap or O_DIRECT
LLM inference	GPU (HBM bw 가 bottleneck)
Multi-socket	NUMA pin + local alloc

기본값: 매 측정 먼저 (perf, FlameGraph) — 매 추측 X.

🔗 Graph

변형: GPU · Memory Hierarchy

🤖 LLM 활용

언제: 매 latency budget 분석, hardware-software co-design, 매 capacity planning. 언제 X: 매 high-level CRUD app — 매 framework default 면 충분.

❌ 안티패턴

False sharing: 매 동일 cache line 을 multiple thread 가 write.
Pointer chasing in hot loop: 매 cache miss 행렬.
Ignoring NUMA: 매 multi-socket 에서 cross-node 매 access bottleneck.
Sync syscall in hot path: 매 io_uring / batching 으로 amortize.
Bandwidth ≠ latency 혼동: HBM 8 TB/s 라도 매 latency 는 ~수백 ns.

🧪 검증 / 중복

Verified (Hennessy & Patterson Computer Architecture 7ed 2024, Intel SDM, Nvidia H200/B200 whitepaper, Linux kernel docs).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — latency numbers + CPU/GPU/NVMe 2026 baseline 정리

4.5 KiB Raw Blame History