--- id: wiki-2026-0508-encoder-decoder-inconsistency title: Encoder Decoder Inconsistency category: 10_Wiki/Topics status: verified canonical_id: self aliases: [encoder-decoder-mismatch, seq2seq-inconsistency] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [nlp, transformer, training, decoding] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Encoder Decoder Inconsistency ## 매 한 줄 > **"매 encoder가 본 분포 ≠ decoder가 생성하는 분포"**. Seq2seq training 시 encoder는 ground-truth context를 보지만 decoder는 inference에서 자기 prediction을 다시 입력으로 받기 때문에 train/inference 간 distribution shift가 발생한다. 매 exposure bias 의 근본 원인. ## 매 핵심 ### 매 정의 - **Train**: decoder input = teacher-forced ground truth. - **Inference**: decoder input = previously generated token. - **Gap**: 매 error compounds along sequence — early mistake → later tokens conditioned on out-of-distribution prefix. ### 매 표현 - Exposure bias (Ranzato 2016). - Schedule sampling 의 motivation. - Hallucination 의 한 원인 (특히 long-form generation). ### 매 응용 1. NMT (Neural Machine Translation) — 매 long sentence translation degradation. 2. Summarization — repetition / drift. 3. Speech recognition — RNN-T vs CTC trade-off. 4. Code generation — 매 long completion 의 syntax break. ## 💻 패턴 ### Scheduled Sampling ```python import torch import torch.nn.functional as F def scheduled_sampling_step(decoder, prev_token, hidden, gt_token, p_use_gt: float): """p_use_gt 의 확률로 ground-truth, 아니면 model prediction 의 사용.""" if torch.rand(1).item() < p_use_gt: input_tok = gt_token else: with torch.no_grad(): logits, _ = decoder(prev_token, hidden) input_tok = logits.argmax(dim=-1) out_logits, hidden = decoder(input_tok, hidden) return out_logits, hidden ``` ### Minimum Risk Training ```python def mrt_loss(model, src, refs, n_samples=8): """매 sequence-level loss 의 — 매 sampled hypotheses 에 대해 risk minimize.""" hyps = [model.sample(src) for _ in range(n_samples)] risks = torch.tensor([1 - bleu(h, refs) for h in hyps]) log_probs = torch.stack([model.log_prob(h, src) for h in hyps]) weights = F.softmax(log_probs, dim=0) return (weights * risks).sum() ``` ### Self-distillation Fix ```python def self_distill(student, teacher, src, T=2.0): """매 teacher 가 자기 생성한 sequence 의 사용 — 매 train/inference gap 축소.""" with torch.no_grad(): gen = teacher.generate(src, do_sample=True, top_p=0.9) teacher_logits = teacher(src, gen).logits student_logits = student(src, gen).logits return F.kl_div( F.log_softmax(student_logits / T, dim=-1), F.softmax(teacher_logits / T, dim=-1), reduction="batchmean", ) * T * T ``` ### Beam Search with Length Penalty ```python def length_penalty(score, length, alpha=0.7): """GNMT length penalty — 매 short hypothesis 의 bias 보정.""" return score / ((5 + length) ** alpha / (5 + 1) ** alpha) ``` ### Contrastive Decoding ```python def contrastive_decode(big, small, prompt, alpha=0.5): """매 large model logit − small model logit — 매 expert/amateur gap 의 강조.""" big_logits = big(prompt).logits[:, -1] small_logits = small(prompt).logits[:, -1] return big_logits - alpha * small_logits ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Short sequence (<32) | Teacher forcing 충분 | | Long sequence | Scheduled sampling / MRT | | Production NMT | Beam + length penalty + coverage | | LLM long-form | Contrastive decoding / self-distillation | **기본값**: teacher forcing + 1k step warmup 이후 scheduled sampling. ## 🔗 Graph - 부모: [[Sequence-to-Sequence]] · [[Transformer]] - 변형: [[Scheduled-Sampling]] · [[Minimum-Risk-Training]] - 응용: [[Neural-Machine-Translation]] · [[Abstractive-Summarization]] - Adjacent: [[Exposure-Bias]] · [[Hallucination]] · [[Beam-Search]] ## 🤖 LLM 활용 **언제**: long-form generation 의 quality issue 분석 시. Train/eval BLEU gap 의 진단. **언제 X**: 매 short classification — 매 inconsistency 의 무관. ## ❌ 안티패턴 - **Pure teacher forcing forever**: 매 inference distribution 의 미본 채 deploy. - **Greedy decoding only**: 매 early mistake 의 lock-in. - **No length normalization**: beam 의 short hypothesis bias. ## 🧪 검증 / 중복 - Verified (Ranzato et al. 2016, Bengio et al. 2015 scheduled sampling). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — encoder/decoder distribution shift + scheduled sampling/MRT/contrastive decoding |