--- id: wiki-2026-0508-sequence-to-sequence-models title: Sequence to Sequence Models category: 10_Wiki/Topics status: verified canonical_id: self aliases: [seq2seq, Encoder-Decoder, Sequence Modeling, Sequence-to-Sequence] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [architecture, nlp, transformer, encoder-decoder] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: PyTorch / Transformers --- # Sequence to Sequence Models ## 매 한 줄 > **"매 input sequence → output sequence — 매 길이 다른 변환"**. 매 Sutskever (2014) RNN encoder-decoder → Bahdanau (2015) attention → Vaswani (2017) Transformer 의 진화. 매 2026: 거의 모든 generative LLM (GPT, Claude, Gemini) 이 매 decoder-only seq2seq, 매 T5/BART 같은 encoder-decoder 는 specific task (번역, summarization fine-tune) 에 잔존. ## 매 핵심 ### 매 Architecture family - **RNN encoder-decoder** (2014): 매 historical, vanishing gradient, no attention. - **Attention seq2seq** (2015): 매 alignment 학습 — 번역 quality 점프. - **Transformer encoder-decoder** (2017): 매 self-attention, parallelizable. T5, BART, mT5. - **Decoder-only** (2018+): GPT family. 매 LLM 의 dominant pattern. - **Encoder-only** (BERT): classification/embedding, generation 아님. ### 매 핵심 컴포넌트 - Tokenizer (BPE, SentencePiece, tiktoken). - Embedding + positional encoding (RoPE, ALiBi 2026 표준). - Self-attention / cross-attention. - Teacher forcing for training, autoregressive decoding for inference. ### 매 Decoding 전략 - Greedy / Beam search — 매 deterministic task. - Sampling (temperature, top-p, top-k, min-p) — 매 creative. - Speculative / Medusa — 매 inference 가속. - Constrained / structured (JSON schema) — 매 tool use. ### 매 응용 1. Machine translation (NLLB, M2M-100). 2. Summarization (BART, Pegasus). 3. Code generation (Claude Code, Copilot). 4. Speech (Whisper encoder + decoder). 5. Image captioning, VQA (multimodal seq2seq). ## 💻 패턴 ### Tiny Transformer encoder-decoder ```python import torch.nn as nn class Seq2Seq(nn.Module): def __init__(self, vocab, d=256, nhead=4, nl=4): super().__init__() self.emb_s = nn.Embedding(vocab, d) self.emb_t = nn.Embedding(vocab, d) self.tx = nn.Transformer(d, nhead, nl, nl, batch_first=True) self.out = nn.Linear(d, vocab) def forward(self, src, tgt): return self.out(self.tx(self.emb_s(src), self.emb_t(tgt))) ``` ### HF Transformers (T5) ```python from transformers import T5ForConditionalGeneration, T5Tokenizer tok = T5Tokenizer.from_pretrained("t5-base") m = T5ForConditionalGeneration.from_pretrained("t5-base") inp = tok("translate English to German: Hello world", return_tensors="pt").input_ids print(tok.decode(m.generate(inp)[0], skip_special_tokens=True)) ``` ### Decoder-only generation (Claude API) ```python import anthropic c = anthropic.Anthropic() msg = c.messages.create( model="claude-opus-4-7", max_tokens=1024, messages=[{"role": "user", "content": "Summarize: ..."}], ) print(msg.content[0].text) ``` ### Beam search decode ```python out = model.generate(input_ids, num_beams=4, length_penalty=0.6, no_repeat_ngram_size=3, max_new_tokens=128) ``` ### Streaming ```python with c.messages.stream(model="claude-opus-4-7", max_tokens=512, messages=msgs) as s: for text in s.text_stream: print(text, end="", flush=True) ``` ### KV cache reuse ```python out = model(**inputs, use_cache=True, past_key_values=pkv) pkv = out.past_key_values # 매 next step 에 재사용 ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | General LLM task | decoder-only (Claude, GPT) | | Specific translation/summarization fine-tune | T5/BART encoder-decoder | | Embedding / classification | encoder-only (BERT family) | | Speech-to-text | Whisper-style enc-dec | | Long sequences, low cost | Mamba / Hybrid seq2seq | **기본값**: decoder-only LLM via API. ## 🔗 Graph - 부모: [[Deep Learning]] · [[NLP]] - 변형: [[Transformer]] · [[Selective State Space Models (Mamba)]] · [[Encoder-Decoder]] - 응용: [[Summarization]] · [[Code-Generation]] - Adjacent: [[Attention Mechanism]] · [[Tokenization]] ## 🤖 LLM 활용 **언제**: input → output 변환 task 정의 가능. 매 API call 로 충분. **언제 X**: pure classification — encoder + head 가 매 더 cheap. ## ❌ 안티패턴 - **Greedy for creative**: repetition. 매 sampling 사용. - **No cache**: O(L²) inference. 매 KV cache 필수. - **Train from scratch**: 매 거의 항상 잘못된 선택. Fine-tune 또는 prompt. ## 🧪 검증 / 중복 - Verified (Sutskever 2014, Vaswani 2017, HF Transformers docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full seq2seq family 2026 |