--- id: wiki-2026-0508-hash-functions-and-maps title: Hash Functions and Maps category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Hash Tables, Hash Maps, Dictionaries, HashMap] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [data-structures, algorithms, hashing, hash-tables, performance] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Rust framework: std::collections + ahash + FxHash --- # Hash Functions and Maps ## 매 한 줄 > **"매 key → bucket index 의 mapping 을 통해 average O(1) lookup/insert 의 data structure"**. 1953년 IBM 의 Hans Peter Luhn 의 origin — 매 modern Rust HashMap (SipHash), Google SwissTable / Abseil flat_hash_map, Python dict (open addressing + perturbation) 의 모두 derivative. 매 cryptographic hash (SHA-256/3, BLAKE3) 와 non-crypto hash (xxHash, ahash, FxHash) 의 distinction. ## 매 핵심 ### 매 Hash function properties - **Determinism**: same input → same output. - **Uniformity**: 매 output 의 uniform distribution. - **Avalanche**: 매 single-bit input change 의 ~50% output bits 의 flip. - **Speed** (non-crypto): 매 ahash/xxHash 의 GB/s. - **Collision resistance** (crypto): 매 finding x≠y, h(x)=h(y) 의 infeasible. ### 매 Hash table strategies - **Separate chaining**: 매 bucket 의 linked list/tree (Java HashMap 의 default since 8 — list→tree at 8). - **Open addressing**: 매 collision 시 alternative slot probe. - Linear probing: 매 +1, +2, ... (cache-friendly but clustering). - Quadratic probing: 매 +1², +2², ... - Double hashing: 매 +h_2(k), +2h_2(k), ... - **Robin Hood hashing**: 매 displacement 의 minimization (Rust hashbrown 의 historical). - **SwissTable** (2017+, Google): 매 SIMD-based metadata + open addressing — 매 modern fastest. ### 매 Load factor & resizing - 매 load factor α = n/m. Open addressing 의 α < 0.75 권장, chaining 의 α < 1 권장. - 매 resize: 매 doubling (m → 2m) + rehash all keys. - Amortized O(1) insert. ### 매 응용 1. **Symbol table** (compiler). 2. **Cache** (LRU, LFU). 3. **Set membership** (HashSet). 4. **Counting** (frequency). 5. **Dedup**. 6. **Database index** (hash join, hash partition). ## 💻 패턴 ### Rust — 매 modern HashMap ```rust use std::collections::HashMap; fn main() { // Default: SipHash-1-3 (DoS-resistant but slower). let mut map: HashMap = HashMap::new(); map.insert("alice".to_string(), 30); map.insert("bob".to_string(), 25); // 매 ergonomic API *map.entry("alice".to_string()).or_insert(0) += 1; if let Some(age) = map.get("alice") { println!("Alice: {}", age); } // 매 iterator for (k, v) in &map { println!("{} = {}", k, v); } } ``` ### Rust — ahash (매 fastest non-crypto, DoS resistant) ```rust // Cargo.toml: ahash = "0.8" use ahash::AHashMap; fn main() { let mut map: AHashMap<&str, i32> = AHashMap::new(); map.insert("hello", 1); // 매 SipHash 보다 ~5x 빠름, AES-NI 사용 시 더 빠름. // 매 production 의 default 권장 (workload 에 따라). } ``` ### Rust — FxHash (매 known-key 의 ultra-fast) ```rust // Cargo.toml: rustc-hash = "1.1" use rustc_hash::FxHashMap; fn main() { let mut map: FxHashMap = FxHashMap::default(); map.insert(42, "answer"); // 매 rustc 내부 사용. 매 NOT DoS-resistant — 매 untrusted input 시 SipHash/aHash 사용. } ``` ### Custom Hash (매 Rust trait) ```rust use std::hash::{Hash, Hasher}; use std::collections::HashMap; #[derive(PartialEq, Eq)] struct Point { x: i32, y: i32 } impl Hash for Point { fn hash(&self, state: &mut H) { // 매 combine fields. 매 Default impl 보다 careful 필요 시 직접. self.x.hash(state); self.y.hash(state); } } ``` ### C++ — 매 std::unordered_map vs absl::flat_hash_map ```cpp #include #include int main() { // std::unordered_map: 매 chaining, slow due to pointer chasing // absl::flat_hash_map: 매 SwissTable, ~2-3x faster absl::flat_hash_map map; map["alice"] = 30; map["bob"] = 25; if (auto it = map.find("alice"); it != map.end()) { std::cout << it->second << "\n"; } } ``` ### Open Addressing (매 simple Linear Probing) ```python class LinearProbingHashMap: def __init__(self, capacity=16): self.capacity = capacity self.size = 0 self.keys = [None] * capacity self.values = [None] * capacity def _probe(self, key): idx = hash(key) % self.capacity while self.keys[idx] is not None and self.keys[idx] != key: idx = (idx + 1) % self.capacity return idx def put(self, key, value): if self.size >= self.capacity * 0.75: self._resize() idx = self._probe(key) if self.keys[idx] is None: self.size += 1 self.keys[idx] = key self.values[idx] = value def get(self, key): idx = self._probe(key) return self.values[idx] if self.keys[idx] is not None else None def _resize(self): old_keys, old_values = self.keys, self.values self.capacity *= 2 self.keys = [None] * self.capacity self.values = [None] * self.capacity self.size = 0 for k, v in zip(old_keys, old_values): if k is not None: self.put(k, v) ``` ### Cryptographic hash (매 SHA-256, BLAKE3) ```rust use sha2::{Sha256, Digest}; use blake3; fn main() { // 매 SHA-256: 매 widely supported but slow (~600 MB/s). let mut hasher = Sha256::new(); hasher.update(b"hello world"); let result = hasher.finalize(); println!("{:x}", result); // 매 BLAKE3: 매 modern fastest crypto hash (~6 GB/s with SIMD). let hash = blake3::hash(b"hello world"); println!("{}", hash); } ``` ### Bloom Filter (매 hash-based set, false-positive OK) ```python import mmh3 # MurmurHash3 from bitarray import bitarray class BloomFilter: def __init__(self, size, num_hashes): self.size = size self.num_hashes = num_hashes self.bits = bitarray(size) self.bits.setall(0) def add(self, item): for i in range(self.num_hashes): idx = mmh3.hash(item, i) % self.size self.bits[idx] = 1 def contains(self, item): return all(self.bits[mmh3.hash(item, i) % self.size] for i in range(self.num_hashes)) bf = BloomFilter(size=10000, num_hashes=7) bf.add("alice") print(bf.contains("alice")) # True (definitely) print(bf.contains("bob")) # False (definitely) or True (false-positive) ``` ## 매 결정 기준 | 상황 | Hash function / Map | |---|---| | Rust trusted input, max speed | FxHash | | Rust untrusted input | std HashMap (SipHash) or aHash | | C++ general | absl::flat_hash_map (SwissTable) | | Python | dict (built-in, optimized) | | Distributed cache key | xxHash3 / FNV-1a | | Cryptographic | BLAKE3 (speed) / SHA-3 (NIST) | | Bloom filter | MurmurHash3 | | String interning | weak hash + linear probe | | Ordered iteration | BTreeMap (not hash) | **기본값**: Rust 매 `std::collections::HashMap`, C++ 매 `absl::flat_hash_map`, Python 매 `dict`. 매 performance-critical 시 ahash/FxHash 으로 교체. ## 🔗 Graph - 변형: [[Bloom-Filter]] · [[HyperLogLog]] · [[Consistent-Hashing]] - Adjacent: [[SHA-256]] · [[xxHash]] ## 🤖 LLM 활용 **언제**: 매 hash function 선택 의 advice, 매 hash table 의 implementation 의 review, 매 collision 의 root cause 의 analysis. **언제 X**: 매 cryptographic hash 의 직접 implement — 매 audited library 사용. 매 production hash function 의 직접 작성. ## ❌ 안티패턴 - **String hashing 없이 length 만 사용**: 매 catastrophic collision. - **Untrusted input 의 FxHash**: 매 HashDoS attack 가능 — SipHash/aHash 사용. - **MD5/SHA-1 신규 사용**: 매 broken — BLAKE3/SHA-256 사용. - **Hash 의 modular reduction 의 비균등**: 매 power-of-2 size + bitmask 또는 fastrange 사용. - **High load factor 의 open addressing**: 매 α > 0.9 의 catastrophic — resize. - **Complex key 의 default hash**: 매 distribution 안 좋을 수 있음 — custom impl. ## 🧪 검증 / 중복 - Verified (Knuth TAOCP Vol 3, "Designing a fast, efficient, cache-friendly hash table", Abseil docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Rust/C++/Python implementations, SwissTable, ahash, BLAKE3 |