Files
2nd/10_Wiki/Topics/Encoder-Decoder-Inconsistency.md
T
2026-05-10 22:08:15 +09:00

139 lines
4.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-encoder-decoder-inconsistency
title: Encoder Decoder Inconsistency
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [encoder-decoder-mismatch, seq2seq-inconsistency]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [nlp, transformer, training, decoding]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch
---
# Encoder Decoder Inconsistency
## 매 한 줄
> **"매 encoder가 본 분포 ≠ decoder가 생성하는 분포"**. Seq2seq training 시 encoder는 ground-truth context를 보지만 decoder는 inference에서 자기 prediction을 다시 입력으로 받기 때문에 train/inference 간 distribution shift가 발생한다. 매 exposure bias 의 근본 원인.
## 매 핵심
### 매 정의
- **Train**: decoder input = teacher-forced ground truth.
- **Inference**: decoder input = previously generated token.
- **Gap**: 매 error compounds along sequence — early mistake → later tokens conditioned on out-of-distribution prefix.
### 매 표현
- Exposure bias (Ranzato 2016).
- Schedule sampling 의 motivation.
- Hallucination 의 한 원인 (특히 long-form generation).
### 매 응용
1. NMT (Neural Machine Translation) — 매 long sentence translation degradation.
2. Summarization — repetition / drift.
3. Speech recognition — RNN-T vs CTC trade-off.
4. Code generation — 매 long completion 의 syntax break.
## 💻 패턴
### Scheduled Sampling
```python
import torch
import torch.nn.functional as F
def scheduled_sampling_step(decoder, prev_token, hidden, gt_token, p_use_gt: float):
"""p_use_gt 의 확률로 ground-truth, 아니면 model prediction 의 사용."""
if torch.rand(1).item() < p_use_gt:
input_tok = gt_token
else:
with torch.no_grad():
logits, _ = decoder(prev_token, hidden)
input_tok = logits.argmax(dim=-1)
out_logits, hidden = decoder(input_tok, hidden)
return out_logits, hidden
```
### Minimum Risk Training
```python
def mrt_loss(model, src, refs, n_samples=8):
"""매 sequence-level loss 의 — 매 sampled hypotheses 에 대해 risk minimize."""
hyps = [model.sample(src) for _ in range(n_samples)]
risks = torch.tensor([1 - bleu(h, refs) for h in hyps])
log_probs = torch.stack([model.log_prob(h, src) for h in hyps])
weights = F.softmax(log_probs, dim=0)
return (weights * risks).sum()
```
### Self-distillation Fix
```python
def self_distill(student, teacher, src, T=2.0):
"""매 teacher 가 자기 생성한 sequence 의 사용 — 매 train/inference gap 축소."""
with torch.no_grad():
gen = teacher.generate(src, do_sample=True, top_p=0.9)
teacher_logits = teacher(src, gen).logits
student_logits = student(src, gen).logits
return F.kl_div(
F.log_softmax(student_logits / T, dim=-1),
F.softmax(teacher_logits / T, dim=-1),
reduction="batchmean",
) * T * T
```
### Beam Search with Length Penalty
```python
def length_penalty(score, length, alpha=0.7):
"""GNMT length penalty — 매 short hypothesis 의 bias 보정."""
return score / ((5 + length) ** alpha / (5 + 1) ** alpha)
```
### Contrastive Decoding
```python
def contrastive_decode(big, small, prompt, alpha=0.5):
"""매 large model logit small model logit — 매 expert/amateur gap 의 강조."""
big_logits = big(prompt).logits[:, -1]
small_logits = small(prompt).logits[:, -1]
return big_logits - alpha * small_logits
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Short sequence (<32) | Teacher forcing 충분 |
| Long sequence | Scheduled sampling / MRT |
| Production NMT | Beam + length penalty + coverage |
| LLM long-form | Contrastive decoding / self-distillation |
**기본값**: teacher forcing + 1k step warmup 이후 scheduled sampling.
## 🔗 Graph
- 부모: [[Sequence-to-Sequence]] · [[Transformer]]
- 변형: [[Scheduled-Sampling]] · [[Minimum-Risk-Training]]
- 응용: [[Neural-Machine-Translation]] · [[Abstractive-Summarization]]
- Adjacent: [[Exposure-Bias]] · [[Hallucination]] · [[Beam-Search]]
## 🤖 LLM 활용
**언제**: long-form generation 의 quality issue 분석 시. Train/eval BLEU gap 의 진단.
**언제 X**: 매 short classification — 매 inconsistency 의 무관.
## ❌ 안티패턴
- **Pure teacher forcing forever**: 매 inference distribution 의 미본 채 deploy.
- **Greedy decoding only**: 매 early mistake 의 lock-in.
- **No length normalization**: beam 의 short hypothesis bias.
## 🧪 검증 / 중복
- Verified (Ranzato et al. 2016, Bengio et al. 2015 scheduled sampling).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — encoder/decoder distribution shift + scheduled sampling/MRT/contrastive decoding |