--- id: cs-reservoir-sampling title: Reservoir Sampling — stream 의 random sample category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [cs, sampling, stream, vibe-coding] tech_stack: { language: "TS", applicable_to: ["Backend", "CS"] } applied_in: [] aliases: [reservoir sampling, stream sampling, Algorithm R, weighted reservoir, log sampling] --- # Reservoir Sampling > Stream (size 모름) 에서 K 개 random sample. **Algorithm R**. Log sampling, A/B test, large dataset 의 representative subset. ## 📖 핵심 개념 - 전체 size 모름 / 메모리 안 들어감. - 1 pass. - 매 element 가 같은 확률. - Memory: O(K). ## 💻 코드 패턴 ### Algorithm R (k=1) ```ts let sample: T | null = null; let count = 0; for (const item of stream) { count++; if (Math.random() < 1 / count) { sample = item; } } return sample; ``` → 매 item 가 1/n 확률. Final = uniform. ### k > 1 ```ts function reservoirSample(stream: Iterable, k: number): T[] { const reservoir: T[] = []; let count = 0; for (const item of stream) { count++; if (reservoir.length < k) { reservoir.push(item); } else { const j = Math.floor(Math.random() * count); if (j < k) reservoir[j] = item; } } return reservoir; } ``` → K 개 의 uniform random sample. ### 증명 (intuition) ``` n 개 item, k=1 sample. Item i 가 final = ? = Math.random() < 1/i 가 true (i 가 sample 됨) × 모든 j > i 에서 Math.random() >= 1/j (sample 가 안 변경) = 1/i × i/(i+1) × ... × (n-1)/n = 1/n ✓ ``` → 매 item 가 1/n. Uniform. ### Use case ``` - Log sampling: 10k QPS log → 100 sample / sec. - Large dataset: 1B row → 10k random. - Streaming analytics: top-K 추정. - Distributed system: 매 node 가 sample. ``` ### Weighted reservoir (Algorithm A-Res) ```ts // 매 item 가 weight w. Sample 확률 ∝ w. function weightedReservoir(stream: Iterable<[T, number]>, k: number): T[] { const heap: [number, T][] = []; // min-heap by key for (const [item, weight] of stream) { const key = Math.pow(Math.random(), 1 / weight); if (heap.length < k) { heap.push([key, item]); heap.sort((a, b) => a[0] - b[0]); // simple, real = priority queue } else if (key > heap[0][0]) { heap[0] = [key, item]; heap.sort((a, b) => a[0] - b[0]); } } return heap.map(([_, item]) => item); } ``` → Weight 가 큰 item 가 더 많이. ### Distributed reservoir ``` N node 가 매 node 의 reservoir K 개. Aggregator 가 N×K 후보 + final K. → MapReduce 의 sample 패턴. ``` ### Datadog log sample ```python # 매 trace 가 다른 sample rate @trace def handle(req): # Datadog 가 high-volume trace 만 sample pass ``` → APM 이 자체 reservoir. ### Approximate quantile ``` Reservoir + sort = quantile 추정. - 1M item → 10k sample → median 추정. - 정확 = 안 (full sort 가 보장). - Approximate = OK (within 1%). ``` → T-Digest / KLL 가 더 정확. ### Postgres TABLESAMPLE ```sql SELECT * FROM big_table TABLESAMPLE BERNOULLI (1); -- 1% SELECT * FROM big_table TABLESAMPLE SYSTEM (1); -- 1% (block-level) ``` → DB 의 sampling. ### A/B test sampling ```ts function shouldSample(userId: string, rate: number): boolean { const hash = murmur(userId); return (hash / 0xffffffff) < rate; } if (shouldSample(user.id, 0.05)) { // 5% user } ``` → Hash-based deterministic. 같은 user = 항상 같은 결과. ### 함정: Math.random() bias ``` JS Math.random() 가 cryptographically secure X. - 통계 OK. - 매우 큰 N (10^15+) 가 안 좋음. → 일반 use = OK. Crypto = crypto.getRandomValues. ``` ### Streaming top-K (different problem) ``` Reservoir 가 random sample. Top-K 가 가장 큰. Top-K = min-heap of size K. ``` ```ts function topK(stream: Iterable<[T, number]>, k: number): T[] { const heap = new MinHeap<[number, T]>(); for (const [item, score] of stream) { if (heap.size() < k) heap.push([score, item]); else if (score > heap.top()![0]) { heap.pop(); heap.push([score, item]); } } return heap.toArray().map(x => x[1]); } ``` ### Probabilistic data structure (관련) ``` - Bloom filter: membership. - HyperLogLog: cardinality (count distinct). - Count-Min Sketch: frequency. - T-Digest / KLL: quantile. - Reservoir: random sample. → "정확 cost vs approximate" trade-off. ``` → [[CS_Probabilistic_Data_Structures]]. ### ClickHouse sample ```sql SELECT count() FROM events SAMPLE 0.1; -- → 10% sample. Approximate count × 10. SELECT * FROM events SAMPLE 1000; -- → ~1000 row 추출. ``` ### Approximate aggregation ```sql -- 매우 큰 table SELECT sum(amount) FROM orders SAMPLE 0.01; -- → 1% sample 의 sum × 100. -- 빠른 + approximate. ``` ### Stratified sampling ``` 사용자 별 비례: - VIP user 의 80% — sample - Free user 의 1% — sample → 매 segment 의 reservoir. ``` ```ts const reservoirs = new Map(); for (const item of stream) { const segment = item.segment; if (!reservoirs.has(segment)) reservoirs.set(segment, []); // ... reservoir per segment } ``` ### Time-based sample ``` - 매 분 의 첫 N event = sample. - 매 hour 의 random N. - Continuous reservoir + decay (옛 = lower weight). ``` ### When 사용? ``` ✓ Stream / 큰 dataset. ✓ Memory limit. ✓ 1-pass. ✓ Approximate OK. ✗ 정확 결과 필요. ✗ Bias 가 critical. ✗ 작은 dataset (그냥 shuffle). ``` ### Implementation 참고 ``` - Apache Spark 의 sample(). - ClickHouse SAMPLE clause. - Postgres TABLESAMPLE. - Datadog / Grafana log sampling. - Most APM (NewRelic, Sentry) trace sampling. ``` ### Real-world ``` 1B log line → keep 100k: - Reservoir K=100k. - Memory: 100k × log size = 작음. - 1 pass. - 다른 분석 후처리. ``` ### LLM dataset sampling ```python # 1B web page → train 의 100M. # Quality + reservoir. for page in pages: quality = score(page) if quality > threshold: reservoir_add(page, weight=quality) ``` → Common Crawl 식 dataset. ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | Stream sample | Reservoir | | Weighted | A-Res | | Top-K | Min-heap | | Quantile | T-Digest / KLL | | Cardinality | HyperLogLog | | Membership | Bloom filter | | 정확 random subset | Full + shuffle | | DB sample | TABLESAMPLE | ## ❌ 안티패턴 - **모든 거 메모리 + shuffle**: 큰 dataset OOM. - **첫 N 가 sample**: bias (head 만). - **Math.random() crypto**: 안 됨. - **Reservoir 가 weighted 모름**: bias. - **Distributed sampling 가 segment 보존 X**: stratification 잃음. - **Sample 후 정확 가정**: approximate 만. ## 🤖 LLM 활용 힌트 - Algorithm R (k=1, k>1) 가 표준. - Weighted = A-Res. - 1-pass + memory O(K). - Probabilistic 가족 (Bloom, HLL, ...) 의 일원. ## 🔗 관련 문서 - [[CS_Probabilistic_Data_Structures]] - [[CS_Bloom_Filter]] - [[CS_Time_Series_Algorithms]]