6.9 KiB
6.9 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cs-reservoir-sampling | Reservoir Sampling — stream 의 random sample | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Reservoir Sampling
Stream (size 모름) 에서 K 개 random sample. Algorithm R. Log sampling, A/B test, large dataset 의 representative subset.
📖 핵심 개념
- 전체 size 모름 / 메모리 안 들어감.
- 1 pass.
- 매 element 가 같은 확률.
- Memory: O(K).
💻 코드 패턴
Algorithm R (k=1)
let sample: T | null = null;
let count = 0;
for (const item of stream) {
count++;
if (Math.random() < 1 / count) {
sample = item;
}
}
return sample;
→ 매 item 가 1/n 확률. Final = uniform.
k > 1
function reservoirSample<T>(stream: Iterable<T>, k: number): T[] {
const reservoir: T[] = [];
let count = 0;
for (const item of stream) {
count++;
if (reservoir.length < k) {
reservoir.push(item);
} else {
const j = Math.floor(Math.random() * count);
if (j < k) reservoir[j] = item;
}
}
return reservoir;
}
→ K 개 의 uniform random sample.
증명 (intuition)
n 개 item, k=1 sample.
Item i 가 final = ?
= Math.random() < 1/i 가 true (i 가 sample 됨)
× 모든 j > i 에서 Math.random() >= 1/j (sample 가 안 변경)
= 1/i × i/(i+1) × ... × (n-1)/n
= 1/n ✓
→ 매 item 가 1/n. Uniform.
Use case
- Log sampling: 10k QPS log → 100 sample / sec.
- Large dataset: 1B row → 10k random.
- Streaming analytics: top-K 추정.
- Distributed system: 매 node 가 sample.
Weighted reservoir (Algorithm A-Res)
// 매 item 가 weight w. Sample 확률 ∝ w.
function weightedReservoir<T>(stream: Iterable<[T, number]>, k: number): T[] {
const heap: [number, T][] = []; // min-heap by key
for (const [item, weight] of stream) {
const key = Math.pow(Math.random(), 1 / weight);
if (heap.length < k) {
heap.push([key, item]);
heap.sort((a, b) => a[0] - b[0]); // simple, real = priority queue
} else if (key > heap[0][0]) {
heap[0] = [key, item];
heap.sort((a, b) => a[0] - b[0]);
}
}
return heap.map(([_, item]) => item);
}
→ Weight 가 큰 item 가 더 많이.
Distributed reservoir
N node 가 매 node 의 reservoir K 개.
Aggregator 가 N×K 후보 + final K.
→ MapReduce 의 sample 패턴.
Datadog log sample
# 매 trace 가 다른 sample rate
@trace
def handle(req):
# Datadog 가 high-volume trace 만 sample
pass
→ APM 이 자체 reservoir.
Approximate quantile
Reservoir + sort = quantile 추정.
- 1M item → 10k sample → median 추정.
- 정확 = 안 (full sort 가 보장).
- Approximate = OK (within 1%).
→ T-Digest / KLL 가 더 정확.
Postgres TABLESAMPLE
SELECT * FROM big_table TABLESAMPLE BERNOULLI (1); -- 1%
SELECT * FROM big_table TABLESAMPLE SYSTEM (1); -- 1% (block-level)
→ DB 의 sampling.
A/B test sampling
function shouldSample(userId: string, rate: number): boolean {
const hash = murmur(userId);
return (hash / 0xffffffff) < rate;
}
if (shouldSample(user.id, 0.05)) {
// 5% user
}
→ Hash-based deterministic. 같은 user = 항상 같은 결과.
함정: Math.random() bias
JS Math.random() 가 cryptographically secure X.
- 통계 OK.
- 매우 큰 N (10^15+) 가 안 좋음.
→ 일반 use = OK.
Crypto = crypto.getRandomValues.
Streaming top-K (different problem)
Reservoir 가 random sample.
Top-K 가 가장 큰.
Top-K = min-heap of size K.
function topK<T>(stream: Iterable<[T, number]>, k: number): T[] {
const heap = new MinHeap<[number, T]>();
for (const [item, score] of stream) {
if (heap.size() < k) heap.push([score, item]);
else if (score > heap.top()![0]) {
heap.pop();
heap.push([score, item]);
}
}
return heap.toArray().map(x => x[1]);
}
Probabilistic data structure (관련)
- Bloom filter: membership.
- HyperLogLog: cardinality (count distinct).
- Count-Min Sketch: frequency.
- T-Digest / KLL: quantile.
- Reservoir: random sample.
→ "정확 cost vs approximate" trade-off.
→ CS_Probabilistic_Data_Structures.
ClickHouse sample
SELECT count() FROM events SAMPLE 0.1;
-- → 10% sample. Approximate count × 10.
SELECT * FROM events SAMPLE 1000;
-- → ~1000 row 추출.
Approximate aggregation
-- 매우 큰 table
SELECT sum(amount) FROM orders SAMPLE 0.01;
-- → 1% sample 의 sum × 100.
-- 빠른 + approximate.
Stratified sampling
사용자 별 비례:
- VIP user 의 80% — sample
- Free user 의 1% — sample
→ 매 segment 의 reservoir.
const reservoirs = new Map<string, T[]>();
for (const item of stream) {
const segment = item.segment;
if (!reservoirs.has(segment)) reservoirs.set(segment, []);
// ... reservoir per segment
}
Time-based sample
- 매 분 의 첫 N event = sample.
- 매 hour 의 random N.
- Continuous reservoir + decay (옛 = lower weight).
When 사용?
✓ Stream / 큰 dataset.
✓ Memory limit.
✓ 1-pass.
✓ Approximate OK.
✗ 정확 결과 필요.
✗ Bias 가 critical.
✗ 작은 dataset (그냥 shuffle).
Implementation 참고
- Apache Spark 의 sample().
- ClickHouse SAMPLE clause.
- Postgres TABLESAMPLE.
- Datadog / Grafana log sampling.
- Most APM (NewRelic, Sentry) trace sampling.
Real-world
1B log line → keep 100k:
- Reservoir K=100k.
- Memory: 100k × log size = 작음.
- 1 pass.
- 다른 분석 후처리.
LLM dataset sampling
# 1B web page → train 의 100M.
# Quality + reservoir.
for page in pages:
quality = score(page)
if quality > threshold:
reservoir_add(page, weight=quality)
→ Common Crawl 식 dataset.
🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Stream sample | Reservoir |
| Weighted | A-Res |
| Top-K | Min-heap |
| Quantile | T-Digest / KLL |
| Cardinality | HyperLogLog |
| Membership | Bloom filter |
| 정확 random subset | Full + shuffle |
| DB sample | TABLESAMPLE |
❌ 안티패턴
- 모든 거 메모리 + shuffle: 큰 dataset OOM.
- 첫 N 가 sample: bias (head 만).
- Math.random() crypto: 안 됨.
- Reservoir 가 weighted 모름: bias.
- Distributed sampling 가 segment 보존 X: stratification 잃음.
- Sample 후 정확 가정: approximate 만.
🤖 LLM 활용 힌트
- Algorithm R (k=1, k>1) 가 표준.
- Weighted = A-Res.
- 1-pass + memory O(K).
- Probabilistic 가족 (Bloom, HLL, ...) 의 일원.