1.0 KiB
1.0 KiB
id, title, category, status, canonical_id, duplicate_of, aliases, source_trust_level, confidence_score, verification_status, tags, last_reinforced, github_commit
| id | title | category | status | canonical_id | duplicate_of | aliases | source_trust_level | confidence_score | verification_status | tags | last_reinforced | github_commit | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-tokenization-strategies | Tokenization Strategies | 10_Wiki/Topics | duplicate | tokenization-subword-processing | Tokenization & Subword Processing | A | 0.9 | redirected |
|
2026-05-10 | pending |
Tokenization Strategies
이 문서는 Tokenization & Subword Processing 의 중복본입니다. Canonical 문서로 redirect.
핵심 요약
- BPE, WordPiece, SentencePiece, Unigram LM 의 subword tokenization 전략들.
- Canonical 문서가 algorithm details, vocab size tradeoff, multilingual considerations 를 다룸.
- 2026: tiktoken (OpenAI), Claude tokenizer, Llama 3 tokenizer (128K vocab).
🔗 Graph
- 부모: Tokenization & Subword Processing (canonical)
🕓 변경 이력
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | 중복 처리 — canonical 문서로 redirect |