Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

6.9 KiB

Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases

title

Reservoir Sampling

Stream (size 모름) 에서 K 개 random sample. Algorithm R. Log sampling, A/B test, large dataset 의 representative subset.

📖 핵심 개념

전체 size 모름 / 메모리 안 들어감.
1 pass.
매 element 가 같은 확률.
Memory: O(K).

💻 코드 패턴

Algorithm R (k=1)

let sample: T | null = null;
let count = 0;

for (const item of stream) {
  count++;
  if (Math.random() < 1 / count) {
    sample = item;
  }
}

return sample;

→ 매 item 가 1/n 확률. Final = uniform.

k > 1

function reservoirSample<T>(stream: Iterable<T>, k: number): T[] {
  const reservoir: T[] = [];
  let count = 0;
  
  for (const item of stream) {
    count++;
    if (reservoir.length < k) {
      reservoir.push(item);
    } else {
      const j = Math.floor(Math.random() * count);
      if (j < k) reservoir[j] = item;
    }
  }
  
  return reservoir;
}

→ K 개 의 uniform random sample.

증명 (intuition)

n 개 item, k=1 sample.

Item i 가 final = ?
= Math.random() < 1/i 가 true (i 가 sample 됨)
× 모든 j > i 에서 Math.random() >= 1/j (sample 가 안 변경)
= 1/i × i/(i+1) × ... × (n-1)/n
= 1/n ✓

→ 매 item 가 1/n. Uniform.

Use case

- Log sampling: 10k QPS log → 100 sample / sec.
- Large dataset: 1B row → 10k random.
- Streaming analytics: top-K 추정.
- Distributed system: 매 node 가 sample.

Weighted reservoir (Algorithm A-Res)

// 매 item 가 weight w. Sample 확률 ∝ w.
function weightedReservoir<T>(stream: Iterable<[T, number]>, k: number): T[] {
  const heap: [number, T][] = [];   // min-heap by key
  
  for (const [item, weight] of stream) {
    const key = Math.pow(Math.random(), 1 / weight);
    if (heap.length < k) {
      heap.push([key, item]);
      heap.sort((a, b) => a[0] - b[0]);  // simple, real = priority queue
    } else if (key > heap[0][0]) {
      heap[0] = [key, item];
      heap.sort((a, b) => a[0] - b[0]);
    }
  }
  
  return heap.map(([_, item]) => item);
}

→ Weight 가 큰 item 가 더 많이.

Distributed reservoir

N node 가 매 node 의 reservoir K 개.
Aggregator 가 N×K 후보 + final K.

→ MapReduce 의 sample 패턴.

Datadog log sample

# 매 trace 가 다른 sample rate
@trace
def handle(req):
    # Datadog 가 high-volume trace 만 sample
    pass

→ APM 이 자체 reservoir.

Approximate quantile

Reservoir + sort = quantile 추정.
- 1M item → 10k sample → median 추정.
- 정확 = 안 (full sort 가 보장).
- Approximate = OK (within 1%).

→ T-Digest / KLL 가 더 정확.

Postgres TABLESAMPLE

SELECT * FROM big_table TABLESAMPLE BERNOULLI (1);  -- 1%
SELECT * FROM big_table TABLESAMPLE SYSTEM (1);     -- 1% (block-level)

→ DB 의 sampling.

A/B test sampling

function shouldSample(userId: string, rate: number): boolean {
  const hash = murmur(userId);
  return (hash / 0xffffffff) < rate;
}

if (shouldSample(user.id, 0.05)) {
  // 5% user
}

→ Hash-based deterministic. 같은 user = 항상 같은 결과.

함정: Math.random() bias

JS Math.random() 가 cryptographically secure X.
- 통계 OK.
- 매우 큰 N (10^15+) 가 안 좋음.

→ 일반 use = OK.
Crypto = crypto.getRandomValues.

Streaming top-K (different problem)

Reservoir 가 random sample.
Top-K 가 가장 큰.

Top-K = min-heap of size K.

function topK<T>(stream: Iterable<[T, number]>, k: number): T[] {
  const heap = new MinHeap<[number, T]>();
  for (const [item, score] of stream) {
    if (heap.size() < k) heap.push([score, item]);
    else if (score > heap.top()![0]) {
      heap.pop();
      heap.push([score, item]);
    }
  }
  return heap.toArray().map(x => x[1]);
}

Probabilistic data structure (관련)

- Bloom filter: membership.
- HyperLogLog: cardinality (count distinct).
- Count-Min Sketch: frequency.
- T-Digest / KLL: quantile.
- Reservoir: random sample.

→ "정확 cost vs approximate" trade-off.

→ CS_Probabilistic_Data_Structures.

ClickHouse sample

SELECT count() FROM events SAMPLE 0.1;
-- → 10% sample. Approximate count × 10.

SELECT * FROM events SAMPLE 1000;
-- → ~1000 row 추출.

Approximate aggregation

-- 매우 큰 table
SELECT sum(amount) FROM orders SAMPLE 0.01;
-- → 1% sample 의 sum × 100.

-- 빠른 + approximate.

Stratified sampling

사용자 별 비례:
- VIP user 의 80% — sample
- Free user 의 1% — sample

→ 매 segment 의 reservoir.

const reservoirs = new Map<string, T[]>();

for (const item of stream) {
  const segment = item.segment;
  if (!reservoirs.has(segment)) reservoirs.set(segment, []);
  // ... reservoir per segment
}

Time-based sample

- 매 분 의 첫 N event = sample.
- 매 hour 의 random N.
- Continuous reservoir + decay (옛 = lower weight).

When 사용?

✓ Stream / 큰 dataset.
✓ Memory limit.
✓ 1-pass.
✓ Approximate OK.

✗ 정확 결과 필요.
✗ Bias 가 critical.
✗ 작은 dataset (그냥 shuffle).

Implementation 참고

- Apache Spark 의 sample().
- ClickHouse SAMPLE clause.
- Postgres TABLESAMPLE.
- Datadog / Grafana log sampling.
- Most APM (NewRelic, Sentry) trace sampling.

Real-world

1B log line → keep 100k:
- Reservoir K=100k.
- Memory: 100k × log size = 작음.
- 1 pass.
- 다른 분석 후처리.

LLM dataset sampling

# 1B web page → train 의 100M.
# Quality + reservoir.

for page in pages:
    quality = score(page)
    if quality > threshold:
        reservoir_add(page, weight=quality)

→ Common Crawl 식 dataset.

🤔 의사결정 기준

작업	추천
Stream sample	Reservoir
Weighted	A-Res
Top-K	Min-heap
Quantile	T-Digest / KLL
Cardinality	HyperLogLog
Membership	Bloom filter
정확 random subset	Full + shuffle
DB sample	TABLESAMPLE

❌ 안티패턴

모든 거 메모리 + shuffle: 큰 dataset OOM.
첫 N 가 sample: bias (head 만).
Math.random() crypto: 안 됨.
Reservoir 가 weighted 모름: bias.
Distributed sampling 가 segment 보존 X: stratification 잃음.
Sample 후 정확 가정: approximate 만.

🤖 LLM 활용 힌트

Algorithm R (k=1, k>1) 가 표준.
Weighted = A-Res.
1-pass + memory O(K).
Probabilistic 가족 (Bloom, HLL, ...) 의 일원.

6.9 KiB Raw Blame History Unescape Escape