"매 text → discrete unit ID sequence". Tokenization 매 LLM input/output pipeline 의 entry/exit point. 매 word-level (OOV 폭발) → character-level (long sequence) → subword (BPE, WordPiece, SentencePiece, Unigram) 의 evolution. 매 modern LLM (GPT, Claude, Llama) 모두 BPE-variant + 100k-256k vocab.
매 핵심
매 algorithms
BPE (Byte-Pair Encoding) — 매 most common pair 의 iteratively merge. 매 GPT-2/3/4, Claude, Llama 모두 byte-level BPE.
WordPiece — 매 BERT 가 사용. 매 BPE-like but uses log-likelihood for merge.
SentencePiece (Unigram) — 매 language-agnostic, no pre-tokenization. 매 T5, mBART, multilingual.
Tiktoken — 매 OpenAI 의 fast Rust BPE impl.
매 byte-level BPE
매 raw UTF-8 byte 의 starting alphabet — 매 256 base tokens.
매 OOV 매 impossible (모든 byte 의 representable).
매 Korean / CJK 매 multi-byte → multi-token (매 3-4 bytes per char).
매 modern vocab sizes (2026)
GPT-4 / Claude 3.x: ~100k.
GPT-4o: ~200k (cl100k → o200k).
Llama 3: 128k.
Gemini: ~256k.
매 larger vocab → fewer tokens per text → cheaper, but larger embedding table.
매 응용
LLM input encoding / output decoding.
Cost / context-budget estimation (매 token = $).
Multilingual fairness (매 vocab choice 매 non-English speakers 의 hit hard).
importsentencepieceasspmspm.SentencePieceTrainer.train(input="corpus.txt",model_prefix="sp",vocab_size=32000,model_type="unigram",character_coverage=0.9995,# 매 multilingual 의 essential)sp=spm.SentencePieceProcessor(model_file="sp.model")print(sp.encode("hello",out_type=str))
fromtransformersimportAutoTokenizertok=AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")ids=tok.apply_chat_template([{"role":"user","content":"hi"}],tokenize=True,add_generation_prompt=True,)# 매 chat template 의 special token 의 자동 insert
Custom merge inspection
# 매 어떤 token 으로 split 되는지 의 inspecttext="안녕하세요"ids=enc.encode(text)foriinids:print(i,repr(enc.decode([i])))
매 결정 기준
상황
Approach
Use existing LLM
Match its tokenizer (tiktoken/HF)
Train new LLM (English)
Byte-level BPE, 32k-128k vocab
Multilingual model
SentencePiece Unigram, high coverage
Code model
Larger vocab (200k+), code-heavy corpus
Korean / CJK heavy
Larger vocab + ensure char coverage
Domain-specific (medical)
Extend vocab with domain merges
기본값: 매 byte-level BPE 32k-128k for English/code, SentencePiece Unigram for multilingual.
언제: 매 token cost estimation, custom tokenizer training, multilingual fairness audit, prompt length debug.
언제 X: 매 trivial LLM API 사용 — 매 implicit, no explicit work needed.
❌ 안티패턴
Wrong tokenizer mismatch: 매 model 과 tokenizer 매 must match — 매 mix 시 garbage output.
Korean = 1 token assumption: 매 byte-level BPE 매 한국어 1 char ≈ 2-3 tokens.
No EOS handling: 매 generation stop token 의 forget — 매 endless output.
Whitespace prefix issue: 매 hello vs hello 의 different token — 매 leading-space sensitivity.