--- id: wiki-2026-0508-ai-sampling-strategies title: AI Sampling Strategies category: 10_Wiki/Topics status: verified canonical_id: self aliases: [LLM Sampling, Decoding Strategies] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [llm, sampling, decoding, inference] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: vllm/transformers --- # AI Sampling Strategies ## 매 한 줄 > **"매 logits → token 의 conversion 의 art — 매 quality vs diversity trade-off."** 매 greedy 의 deterministic — 매 temperature/top-k/top-p 의 stochastic — 매 2026 에 min-p, mirostat, speculative decoding 의 mainstream. ## 매 핵심 ### 매 deterministic - **Greedy**: argmax token. 매 repetitive 의 risk. - **Beam search**: 매 top-B sequences 의 maintain. 매 translation 의 useful, 매 open-ended 의 bland. ### 매 stochastic - **Temperature**: logits / T. T<1 sharpen, T>1 flatten. - **Top-k**: 매 top-k tokens 의 sample. - **Top-p (nucleus)**: 매 cumulative prob ≥ p 의 smallest set. - **Min-p**: 매 P(top) * min_p 의 threshold — 매 top-p 의 better. - **Typical-p**: 매 entropy-based — 매 typical tokens. - **Mirostat**: 매 perplexity targeting feedback control. ### 매 응용 1. Creative writing — 매 high temp + top-p. 2. Code generation — 매 low temp + greedy fallback. 3. Reasoning — 매 self-consistency (sample N, majority vote). ## 💻 패턴 ### Temperature + top-p ```python import torch import torch.nn.functional as F def sample(logits, temperature=0.7, top_p=0.9): logits = logits / temperature probs = F.softmax(logits, dim=-1) sorted_probs, sorted_idx = probs.sort(descending=True) cumsum = sorted_probs.cumsum(-1) mask = cumsum - sorted_probs > top_p sorted_probs[mask] = 0 sorted_probs /= sorted_probs.sum() pick = torch.multinomial(sorted_probs, 1) return sorted_idx.gather(-1, pick) ``` ### Min-p ```python def min_p_sample(logits, min_p=0.05, temperature=1.0): logits = logits / temperature probs = F.softmax(logits, dim=-1) threshold = probs.max(-1, keepdim=True).values * min_p probs = torch.where(probs >= threshold, probs, torch.zeros_like(probs)) probs /= probs.sum(-1, keepdim=True) return torch.multinomial(probs, 1) ``` ### Self-consistency (CoT majority vote) ```python def self_consistency(prompt, llm, n=20): answers = [] for _ in range(n): cot = llm.generate(prompt, temperature=0.7) answers.append(extract_final_answer(cot)) from collections import Counter return Counter(answers).most_common(1)[0][0] ``` ### Speculative decoding ```python def speculative(target, draft, prompt, k=4): # Draft k tokens cheaply, target verifies in parallel ctx = prompt while not done(ctx): draft_tokens, draft_probs = draft.generate(ctx, k) target_probs = target.score(ctx, draft_tokens) accepted = [] for i, (dp, tp) in enumerate(zip(draft_probs, target_probs)): r = torch.rand(1) if r < min(1, tp / dp): accepted.append(draft_tokens[i]) else: # Reject, sample from (target - draft)+ then break resample = sample_diff(target_probs[i], draft_probs[i]) accepted.append(resample) break else: accepted.append(target.sample(ctx + accepted)) ctx += accepted return ctx ``` ### Mirostat (perplexity control) ```python def mirostat(logits, mu, tau=5.0, eta=0.1): # Adaptively adjusts top-k to target surprise tau sorted_probs, idx = F.softmax(logits, -1).sort(descending=True) s = -torch.log(sorted_probs) k = (s < mu).sum().item() k = max(k, 1) pick = torch.multinomial(sorted_probs[:k] / sorted_probs[:k].sum(), 1) surprise = -torch.log(sorted_probs[pick]) mu = mu - eta * (surprise - tau) return idx[pick], mu ``` ### Repetition penalty ```python def apply_repetition_penalty(logits, generated_ids, penalty=1.1): for tok in set(generated_ids): if logits[tok] < 0: logits[tok] *= penalty else: logits[tok] /= penalty return logits ``` ## 매 결정 기준 | 상황 | Sampler | |---|---| | Code, math, structured | T=0 greedy or T=0.2 | | Chat / general | T=0.7, top-p=0.9 or min-p=0.05 | | Creative / fiction | T=1.0+, min-p=0.02 | | Reasoning ensemble | T=0.7, n=20, majority vote | | Translation | Beam search (B=4-8) | | Latency-critical | Speculative decoding (target + small draft) | **기본값**: 매 T=0.7 + min-p=0.05. ## 🔗 Graph - 부모: [[Decoding]] - 응용: [[Self-Consistency]] - Adjacent: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] ## 🤖 LLM 활용 **언제**: 매 inference pipeline 의 every call — 매 task 의 sampler 의 match. **언제 X**: 매 logprob analysis (no sampling needed). ## ❌ 안티패턴 - **High temp + greedy fallback**: 매 inconsistent — 매 single sampler. - **Top-k=1 with high temp**: 매 contradictory. - **No repetition penalty on long outputs**: 매 loops. - **Speculative without acceptance check**: 매 distribution shift. ## 🧪 검증 / 중복 - Verified (Holtzman et al. nucleus sampling 2020, Leviathan et al. speculative 2023, min-p paper 2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — sampler taxonomy + working code |