"매 parameter 는 model behavior 의 dial — temperature, top_p, top_k, seed 가 매 generation 의 character 를 결정". 2026년 production LLM 의 매 endpoint 가 노출하는 sampling knobs (Anthropic, OpenAI, vLLM, Ollama) + image gen 의 cfg/steps/scheduler — 매 정량적 control 의 핵심.
매 핵심
매 LLM sampling parameters
temperature [0, 2]: logit scaling. 0 = greedy, 1 = raw distribution, >1 = flatten. 매 deterministic task 는 0, creative 는 0.7~1.0.
top_p (nucleus) [0, 1]: cumulative prob mass. 0.9 = 매 top-90% mass tokens 만 sample.
top_k: 매 top-K logits 만 유지. vLLM 은 -1 = disabled.
min_p [0, 1]: relative threshold (vs top token prob). 매 modern alternative to top_p.
fromopenaiimportOpenAIclient=OpenAI()resp=client.chat.completions.create(model="gpt-5",temperature=0.9,top_p=0.95,presence_penalty=0.6,frequency_penalty=0.3,max_completion_tokens=2000,seed=42,# best-effort reproducibilitymessages=[{"role":"user","content":"Write a noir opening."}],)
vLLM — full sampling control (self-host Llama 3.3)
fromvllmimportLLM,SamplingParamsllm=LLM(model="meta-llama/Llama-3.3-70B-Instruct",tensor_parallel_size=4)params=SamplingParams(temperature=0.7,top_p=0.9,top_k=50,min_p=0.05,# modern alternativerepetition_penalty=1.1,max_tokens=512,stop=["</answer>"],seed=2026,logprobs=5,# debugging)outputs=llm.generate(["Explain mixture-of-experts."],params)
ComfyUI API — programmatic SD3.5 with full control
importjson,requestsworkflow={"sampler":{"class_type":"KSampler","inputs":{"seed":42,"steps":30,"cfg":7.0,"sampler_name":"dpmpp_2m","scheduler":"karras","denoise":1.0,"model":["loader",0],"positive":["pos_clip",0],"negative":["neg_clip",0],"latent_image":["empty_latent",0],},},# ... rest of graph}r=requests.post("http://localhost:8188/prompt",json={"prompt":workflow})
Sweep parameters with Optuna for prompt+param tuning
언제: 매 deterministic 결과 필요 (RAG, extraction) — temp=0. 매 creative output — temp 0.7+. 매 reproduce bug — seed 고정.
언제 X: 매 model 마다 seed 의 strict determinism 보장 X (특히 multi-GPU). 매 production 에서 seed 의존 X.
❌ 안티패턴
temperature=0 + top_p<1: 매 redundant (greedy 가 이미 top-1).
temperature 1.5+ in production: 매 hallucination/incoherence spike.
seed 만 고정 + temperature 0.7: 매 batched inference 에서 비결정적.
max_tokens=4096 default: 매 cost blowup. Task-budgeted.