Files
2nd/10_Wiki/Topics/Coding/CS_Reservoir_Sampling.md
T
2026-05-10 22:08:15 +09:00

324 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: cs-reservoir-sampling
title: Reservoir Sampling — stream 의 random sample
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [cs, sampling, stream, vibe-coding]
tech_stack: { language: "TS", applicable_to: ["Backend", "CS"] }
applied_in: []
aliases: [reservoir sampling, stream sampling, Algorithm R, weighted reservoir, log sampling]
---
# Reservoir Sampling
> Stream (size 모름) 에서 K 개 random sample. **Algorithm R**. Log sampling, A/B test, large dataset 의 representative subset.
## 📖 핵심 개념
- 전체 size 모름 / 메모리 안 들어감.
- 1 pass.
- 매 element 가 같은 확률.
- Memory: O(K).
## 💻 코드 패턴
### Algorithm R (k=1)
```ts
let sample: T | null = null;
let count = 0;
for (const item of stream) {
count++;
if (Math.random() < 1 / count) {
sample = item;
}
}
return sample;
```
→ 매 item 가 1/n 확률. Final = uniform.
### k > 1
```ts
function reservoirSample<T>(stream: Iterable<T>, k: number): T[] {
const reservoir: T[] = [];
let count = 0;
for (const item of stream) {
count++;
if (reservoir.length < k) {
reservoir.push(item);
} else {
const j = Math.floor(Math.random() * count);
if (j < k) reservoir[j] = item;
}
}
return reservoir;
}
```
→ K 개 의 uniform random sample.
### 증명 (intuition)
```
n 개 item, k=1 sample.
Item i 가 final = ?
= Math.random() < 1/i 가 true (i 가 sample 됨)
× 모든 j > i 에서 Math.random() >= 1/j (sample 가 안 변경)
= 1/i × i/(i+1) × ... × (n-1)/n
= 1/n ✓
```
→ 매 item 가 1/n. Uniform.
### Use case
```
- Log sampling: 10k QPS log → 100 sample / sec.
- Large dataset: 1B row → 10k random.
- Streaming analytics: top-K 추정.
- Distributed system: 매 node 가 sample.
```
### Weighted reservoir (Algorithm A-Res)
```ts
// 매 item 가 weight w. Sample 확률 ∝ w.
function weightedReservoir<T>(stream: Iterable<[T, number]>, k: number): T[] {
const heap: [number, T][] = []; // min-heap by key
for (const [item, weight] of stream) {
const key = Math.pow(Math.random(), 1 / weight);
if (heap.length < k) {
heap.push([key, item]);
heap.sort((a, b) => a[0] - b[0]); // simple, real = priority queue
} else if (key > heap[0][0]) {
heap[0] = [key, item];
heap.sort((a, b) => a[0] - b[0]);
}
}
return heap.map(([_, item]) => item);
}
```
→ Weight 가 큰 item 가 더 많이.
### Distributed reservoir
```
N node 가 매 node 의 reservoir K 개.
Aggregator 가 N×K 후보 + final K.
→ MapReduce 의 sample 패턴.
```
### Datadog log sample
```python
# 매 trace 가 다른 sample rate
@trace
def handle(req):
# Datadog 가 high-volume trace 만 sample
pass
```
→ APM 이 자체 reservoir.
### Approximate quantile
```
Reservoir + sort = quantile 추정.
- 1M item → 10k sample → median 추정.
- 정확 = 안 (full sort 가 보장).
- Approximate = OK (within 1%).
```
→ T-Digest / KLL 가 더 정확.
### Postgres TABLESAMPLE
```sql
SELECT * FROM big_table TABLESAMPLE BERNOULLI (1); -- 1%
SELECT * FROM big_table TABLESAMPLE SYSTEM (1); -- 1% (block-level)
```
→ DB 의 sampling.
### A/B test sampling
```ts
function shouldSample(userId: string, rate: number): boolean {
const hash = murmur(userId);
return (hash / 0xffffffff) < rate;
}
if (shouldSample(user.id, 0.05)) {
// 5% user
}
```
→ Hash-based deterministic. 같은 user = 항상 같은 결과.
### 함정: Math.random() bias
```
JS Math.random() 가 cryptographically secure X.
- 통계 OK.
- 매우 큰 N (10^15+) 가 안 좋음.
→ 일반 use = OK.
Crypto = crypto.getRandomValues.
```
### Streaming top-K (different problem)
```
Reservoir 가 random sample.
Top-K 가 가장 큰.
Top-K = min-heap of size K.
```
```ts
function topK<T>(stream: Iterable<[T, number]>, k: number): T[] {
const heap = new MinHeap<[number, T]>();
for (const [item, score] of stream) {
if (heap.size() < k) heap.push([score, item]);
else if (score > heap.top()![0]) {
heap.pop();
heap.push([score, item]);
}
}
return heap.toArray().map(x => x[1]);
}
```
### Probabilistic data structure (관련)
```
- Bloom filter: membership.
- HyperLogLog: cardinality (count distinct).
- Count-Min Sketch: frequency.
- T-Digest / KLL: quantile.
- Reservoir: random sample.
→ "정확 cost vs approximate" trade-off.
```
→ [[CS_Probabilistic_Data_Structures]].
### ClickHouse sample
```sql
SELECT count() FROM events SAMPLE 0.1;
-- → 10% sample. Approximate count × 10.
SELECT * FROM events SAMPLE 1000;
-- → ~1000 row 추출.
```
### Approximate aggregation
```sql
-- 매우 큰 table
SELECT sum(amount) FROM orders SAMPLE 0.01;
-- → 1% sample 의 sum × 100.
-- 빠른 + approximate.
```
### Stratified sampling
```
사용자 별 비례:
- VIP user 의 80% — sample
- Free user 의 1% — sample
→ 매 segment 의 reservoir.
```
```ts
const reservoirs = new Map<string, T[]>();
for (const item of stream) {
const segment = item.segment;
if (!reservoirs.has(segment)) reservoirs.set(segment, []);
// ... reservoir per segment
}
```
### Time-based sample
```
- 매 분 의 첫 N event = sample.
- 매 hour 의 random N.
- Continuous reservoir + decay (옛 = lower weight).
```
### When 사용?
```
✓ Stream / 큰 dataset.
✓ Memory limit.
✓ 1-pass.
✓ Approximate OK.
✗ 정확 결과 필요.
✗ Bias 가 critical.
✗ 작은 dataset (그냥 shuffle).
```
### Implementation 참고
```
- Apache Spark 의 sample().
- ClickHouse SAMPLE clause.
- Postgres TABLESAMPLE.
- Datadog / Grafana log sampling.
- Most APM (NewRelic, Sentry) trace sampling.
```
### Real-world
```
1B log line → keep 100k:
- Reservoir K=100k.
- Memory: 100k × log size = 작음.
- 1 pass.
- 다른 분석 후처리.
```
### LLM dataset sampling
```python
# 1B web page → train 의 100M.
# Quality + reservoir.
for page in pages:
quality = score(page)
if quality > threshold:
reservoir_add(page, weight=quality)
```
→ Common Crawl 식 dataset.
## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Stream sample | Reservoir |
| Weighted | A-Res |
| Top-K | Min-heap |
| Quantile | T-Digest / KLL |
| Cardinality | HyperLogLog |
| Membership | Bloom filter |
| 정확 random subset | Full + shuffle |
| DB sample | TABLESAMPLE |
## ❌ 안티패턴
- **모든 거 메모리 + shuffle**: 큰 dataset OOM.
- **첫 N 가 sample**: bias (head 만).
- **Math.random() crypto**: 안 됨.
- **Reservoir 가 weighted 모름**: bias.
- **Distributed sampling 가 segment 보존 X**: stratification 잃음.
- **Sample 후 정확 가정**: approximate 만.
## 🤖 LLM 활용 힌트
- Algorithm R (k=1, k>1) 가 표준.
- Weighted = A-Res.
- 1-pass + memory O(K).
- Probabilistic 가족 (Bloom, HLL, ...) 의 일원.
## 🔗 관련 문서
- [[CS_Probabilistic_Data_Structures]]
- [[CS_Bloom_Filter]]
- [[CS_Time_Series_Algorithms]]