Files
2nd/10_Wiki/Topics/AI_and_ML/Tokenization-Strategies.md
2026-05-10 22:08:15 +09:00

34 lines
1.0 KiB
Markdown

---
id: wiki-2026-0508-tokenization-strategies
title: Tokenization Strategies
category: 10_Wiki/Topics
status: duplicate
canonical_id: tokenization-subword-processing
duplicate_of: "[[Tokenization & Subword Processing]]"
aliases: []
source_trust_level: A
confidence_score: 0.9
verification_status: redirected
tags: [duplicate, tokenization, nlp, bpe]
last_reinforced: 2026-05-10
github_commit: pending
---
# Tokenization Strategies
> **이 문서는 [[Tokenization & Subword Processing]] 의 중복본입니다.** Canonical 문서로 redirect.
## 핵심 요약
- BPE, WordPiece, SentencePiece, Unigram LM 의 subword tokenization 전략들.
- Canonical 문서가 algorithm details, vocab size tradeoff, multilingual considerations 를 다룸.
- 2026: tiktoken (OpenAI), Claude tokenizer, Llama 3 tokenizer (128K vocab).
## 🔗 Graph
- 부모: [[Tokenization & Subword Processing]] (canonical)
## 🕓 변경 이력
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | 중복 처리 — canonical 문서로 redirect |