2nd/10_Wiki/Topics/Coding/CS_Reservoir_Sampling.md

---
id: cs-reservoir-sampling
title: Reservoir Sampling — stream 의 random sample
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [cs, sampling, stream, vibe-coding]
tech_stack: { language: "TS", applicable_to: ["Backend", "CS"] }
applied_in: []
aliases: [reservoir sampling, stream sampling, Algorithm R, weighted reservoir, log sampling]
---

# Reservoir Sampling

> Stream (size 모름) 에서 K 개 random sample. **Algorithm R**. Log sampling, A/B test, large dataset 의 representative subset.

## 📖 핵심 개념
- 전체 size 모름 / 메모리 안 들어감.
- 1 pass.
- 매 element 가 같은 확률.
- Memory: O(K).

## 💻 코드 패턴

### Algorithm R (k=1)
```ts
let sample: T | null = null;
let count = 0;

for (const item of stream) {
  count++;
  if (Math.random() < 1 / count) {
    sample = item;
  }
}

return sample;
```

→ 매 item 가 1/n 확률. Final = uniform.

### k > 1
```ts
function reservoirSample<T>(stream: Iterable<T>, k: number): T[] {
  const reservoir: T[] = [];
  let count = 0;

  for (const item of stream) {
    count++;
    if (reservoir.length < k) {
      reservoir.push(item);
    } else {
      const j = Math.floor(Math.random() * count);
      if (j < k) reservoir[j] = item;
    }
  }

  return reservoir;
}
```

→ K 개 의 uniform random sample.

### 증명 (intuition)
```
n 개 item, k=1 sample.

Item i 가 final = ?
= Math.random() < 1/i 가 true (i 가 sample 됨)
× 모든 j > i 에서 Math.random() >= 1/j (sample 가 안 변경)
= 1/i × i/(i+1) × ... × (n-1)/n
= 1/n ✓
```

→ 매 item 가 1/n. Uniform.

### Use case
```
- Log sampling: 10k QPS log → 100 sample / sec.
- Large dataset: 1B row → 10k random.
- Streaming analytics: top-K 추정.
- Distributed system: 매 node 가 sample.
```

### Weighted reservoir (Algorithm A-Res)
```ts
// 매 item 가 weight w. Sample 확률 ∝ w.
function weightedReservoir<T>(stream: Iterable<[T, number]>, k: number): T[] {
  const heap: [number, T][] = [];   // min-heap by key

  for (const [item, weight] of stream) {
    const key = Math.pow(Math.random(), 1 / weight);
    if (heap.length < k) {
      heap.push([key, item]);
      heap.sort((a, b) => a[0] - b[0]);  // simple, real = priority queue
    } else if (key > heap[0][0]) {
      heap[0] = [key, item];
      heap.sort((a, b) => a[0] - b[0]);
    }
  }

  return heap.map(([_, item]) => item);
}
```

→ Weight 가 큰 item 가 더 많이.

### Distributed reservoir
```
N node 가 매 node 의 reservoir K 개.
Aggregator 가 N×K 후보 + final K.

→ MapReduce 의 sample 패턴.
```

### Datadog log sample
```python
# 매 trace 가 다른 sample rate
@trace
def handle(req):
    # Datadog 가 high-volume trace 만 sample
    pass
```

→ APM 이 자체 reservoir.

### Approximate quantile
```
Reservoir + sort = quantile 추정.
- 1M item → 10k sample → median 추정.
- 정확 = 안 (full sort 가 보장).
- Approximate = OK (within 1%).
```

→ T-Digest / KLL 가 더 정확.

### Postgres TABLESAMPLE
```sql
SELECT * FROM big_table TABLESAMPLE BERNOULLI (1);  -- 1%
SELECT * FROM big_table TABLESAMPLE SYSTEM (1);     -- 1% (block-level)
```

→ DB 의 sampling.

### A/B test sampling
```ts
function shouldSample(userId: string, rate: number): boolean {
  const hash = murmur(userId);
  return (hash / 0xffffffff) < rate;
}

if (shouldSample(user.id, 0.05)) {
  // 5% user
}
```

→ Hash-based deterministic. 같은 user = 항상 같은 결과.

### 함정: Math.random() bias
```
JS Math.random() 가 cryptographically secure X.
- 통계 OK.
- 매우 큰 N (10^15+) 가 안 좋음.

→ 일반 use = OK.
Crypto = crypto.getRandomValues.
```

### Streaming top-K (different problem)
```
Reservoir 가 random sample.
Top-K 가 가장 큰.

Top-K = min-heap of size K.
```

```ts
function topK<T>(stream: Iterable<[T, number]>, k: number): T[] {
  const heap = new MinHeap<[number, T]>();
  for (const [item, score] of stream) {
    if (heap.size() < k) heap.push([score, item]);
    else if (score > heap.top()![0]) {
      heap.pop();
      heap.push([score, item]);
    }
  }
  return heap.toArray().map(x => x[1]);
}
```

### Probabilistic data structure (관련)
```
- Bloom filter: membership.
- HyperLogLog: cardinality (count distinct).
- Count-Min Sketch: frequency.
- T-Digest / KLL: quantile.
- Reservoir: random sample.

→ "정확 cost vs approximate" trade-off.
```

→ [[CS_Probabilistic_Data_Structures]].

### ClickHouse sample
```sql
SELECT count() FROM events SAMPLE 0.1;
-- → 10% sample. Approximate count × 10.

SELECT * FROM events SAMPLE 1000;
-- → ~1000 row 추출.
```

### Approximate aggregation
```sql
-- 매우 큰 table
SELECT sum(amount) FROM orders SAMPLE 0.01;
-- → 1% sample 의 sum × 100.

-- 빠른 + approximate.
```

### Stratified sampling
```
사용자 별 비례:
- VIP user 의 80% — sample
- Free user 의 1% — sample

→ 매 segment 의 reservoir.
```

```ts
const reservoirs = new Map<string, T[]>();

for (const item of stream) {
  const segment = item.segment;
  if (!reservoirs.has(segment)) reservoirs.set(segment, []);
  // ... reservoir per segment
}
```

### Time-based sample
```
- 매 분 의 첫 N event = sample.
- 매 hour 의 random N.
- Continuous reservoir + decay (옛 = lower weight).
```

### When 사용?
```
✓ Stream / 큰 dataset.
✓ Memory limit.
✓ 1-pass.
✓ Approximate OK.

✗ 정확 결과 필요.
✗ Bias 가 critical.
✗ 작은 dataset (그냥 shuffle).
```

### Implementation 참고
```
- Apache Spark 의 sample().
- ClickHouse SAMPLE clause.
- Postgres TABLESAMPLE.
- Datadog / Grafana log sampling.
- Most APM (NewRelic, Sentry) trace sampling.
```

### Real-world
```
1B log line → keep 100k:
- Reservoir K=100k.
- Memory: 100k × log size = 작음.
- 1 pass.
- 다른 분석 후처리.
```

### LLM dataset sampling
```python
# 1B web page → train 의 100M.
# Quality + reservoir.

for page in pages:
    quality = score(page)
    if quality > threshold:
        reservoir_add(page, weight=quality)
```

→ Common Crawl 식 dataset.

## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Stream sample | Reservoir |
| Weighted | A-Res |
| Top-K | Min-heap |
| Quantile | T-Digest / KLL |
| Cardinality | HyperLogLog |
| Membership | Bloom filter |
| 정확 random subset | Full + shuffle |
| DB sample | TABLESAMPLE |

## ❌ 안티패턴
- **모든 거 메모리 + shuffle**: 큰 dataset OOM.
- **첫 N 가 sample**: bias (head 만).
- **Math.random() crypto**: 안 됨.
- **Reservoir 가 weighted 모름**: bias.
- **Distributed sampling 가 segment 보존 X**: stratification 잃음.
- **Sample 후 정확 가정**: approximate 만.

## 🤖 LLM 활용 힌트
- Algorithm R (k=1, k>1) 가 표준.
- Weighted = A-Res.
- 1-pass + memory O(K).
- Probabilistic 가족 (Bloom, HLL, ...) 의 일원.

## 🔗 관련 문서
- [[CS_Probabilistic_Data_Structures]]
- [[CS_Bloom_Filter]]
- [[CS_Time_Series_Algorithms]]