"매 encoder가 본 분포 ≠ decoder가 생성하는 분포". Seq2seq training 시 encoder는 ground-truth context를 보지만 decoder는 inference에서 자기 prediction을 다시 입력으로 받기 때문에 train/inference 간 distribution shift가 발생한다. 매 exposure bias 의 근본 원인.
Gap: 매 error compounds along sequence — early mistake → later tokens conditioned on out-of-distribution prefix.
매 표현
Exposure bias (Ranzato 2016).
Schedule sampling 의 motivation.
Hallucination 의 한 원인 (특히 long-form generation).
매 응용
NMT (Neural Machine Translation) — 매 long sentence translation degradation.
Summarization — repetition / drift.
Speech recognition — RNN-T vs CTC trade-off.
Code generation — 매 long completion 의 syntax break.
💻 패턴
Scheduled Sampling
importtorchimporttorch.nn.functionalasFdefscheduled_sampling_step(decoder,prev_token,hidden,gt_token,p_use_gt:float):"""p_use_gt 의 확률로 ground-truth, 아니면 model prediction 의 사용."""iftorch.rand(1).item()<p_use_gt:input_tok=gt_tokenelse:withtorch.no_grad():logits,_=decoder(prev_token,hidden)input_tok=logits.argmax(dim=-1)out_logits,hidden=decoder(input_tok,hidden)returnout_logits,hidden
Minimum Risk Training
defmrt_loss(model,src,refs,n_samples=8):"""매 sequence-level loss 의 — 매 sampled hypotheses 에 대해 risk minimize."""hyps=[model.sample(src)for_inrange(n_samples)]risks=torch.tensor([1-bleu(h,refs)forhinhyps])log_probs=torch.stack([model.log_prob(h,src)forhinhyps])weights=F.softmax(log_probs,dim=0)return(weights*risks).sum()
Self-distillation Fix
defself_distill(student,teacher,src,T=2.0):"""매 teacher 가 자기 생성한 sequence 의 사용 — 매 train/inference gap 축소."""withtorch.no_grad():gen=teacher.generate(src,do_sample=True,top_p=0.9)teacher_logits=teacher(src,gen).logitsstudent_logits=student(src,gen).logitsreturnF.kl_div(F.log_softmax(student_logits/T,dim=-1),F.softmax(teacher_logits/T,dim=-1),reduction="batchmean",)*T*T
Beam Search with Length Penalty
deflength_penalty(score,length,alpha=0.7):"""GNMT length penalty — 매 short hypothesis 의 bias 보정."""returnscore/((5+length)**alpha/(5+1)**alpha)
Contrastive Decoding
defcontrastive_decode(big,small,prompt,alpha=0.5):"""매 large model logit − small model logit — 매 expert/amateur gap 의 강조."""big_logits=big(prompt).logits[:,-1]small_logits=small(prompt).logits[:,-1]returnbig_logits-alpha*small_logits
매 결정 기준
상황
Approach
Short sequence (<32)
Teacher forcing 충분
Long sequence
Scheduled sampling / MRT
Production NMT
Beam + length penalty + coverage
LLM long-form
Contrastive decoding / self-distillation
기본값: teacher forcing + 1k step warmup 이후 scheduled sampling.