--- id: wiki-2026-0508-vocabulary-expansion title: Vocabulary Expansion category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Vocab Expansion, Tokenizer Extension, Domain Vocabulary] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [nlp, tokenizer, vocabulary, llm, fine-tuning] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: transformers, sentencepiece, tokenizers --- # Vocabulary Expansion ## 매 한 줄 > **"매 base tokenizer 에 domain token 을 grafting 하는 surgery"**. 매 BPE / SentencePiece tokenizer 의 vocab 을 확장 — 매 새 token embedding 의 initialize, 매 LM head 의 resize, 매 continued pretraining 의 alignment. 매 2026 의 Llama 3.x / Qwen 3 / Gemma 3 의 multilingual extension 의 standard recipe. ## 매 핵심 ### 매 왜 expand - **Tokenization efficiency**: 매 Korean / Japanese / code 의 base tokenizer 의 over-fragmentation — "안녕하세요" 의 8 token 의 1 token 의 reduction. - **Domain coverage**: 매 medical / legal / chemistry term 의 single-token representation. - **Inference cost**: 매 sequence length 의 reduction 의 latency / cost 의 직접적 saving. - **Quality**: 매 long-tail token 의 gradient signal 의 improvement. ### 매 expansion 방식 1. **Pure addition**: 매 base vocab 의 그대로 + 매 new token 의 append. Embedding matrix 의 row append. 2. **Merge new tokenizer**: 매 domain corpus 의 새 BPE 의 train → 매 base 와 union → 매 conflict resolution. 3. **Token replacement**: 매 unused token (e.g., ``) 의 reuse — 매 vocab size 의 unchanged. ### 매 embedding init 전략 - **Mean init**: 매 새 token 의 sub-word embedding 의 mean. - **Random + small std**: 매 N(0, 0.02) — 매 risky. - **FOCUS / WECHSEL**: 매 source language embedding 의 nearest-neighbor mapping. - **OFA (One For All)**: 매 multilingual transfer 의 SOTA (2024). ### 매 응용 1. 매 English-only LLM 의 Korean / Japanese / Arabic extension. 2. 매 code LLM 의 새 language (Mojo, Zig) 의 token addition. 3. 매 biomedical LLM (PubMedBERT) 의 specialized term integration. 4. 매 retrieval-augmented model 의 special control token (``, ``) 추가. ## 💻 패턴 ### Tokenizer 확장 (HuggingFace) ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-8B") model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-8B") # Add domain tokens new_tokens = ["[[CHEMICAL]]", "[[GENE]]", "ACE2", "SARS-CoV-2"] num_added = tokenizer.add_tokens(new_tokens) print(f"Added {num_added} tokens, new vocab size: {len(tokenizer)}") # Resize embedding + LM head model.resize_token_embeddings(len(tokenizer)) ``` ### Mean-init for new token embeddings ```python def init_new_embeddings_by_subword_mean(model, tokenizer, new_tokens, base_tokenizer): embed = model.get_input_embeddings().weight.data with torch.no_grad(): for tok in new_tokens: tok_id = tokenizer.convert_tokens_to_ids(tok) # Tokenize the surface form with the BASE tokenizer sub_ids = base_tokenizer(tok, add_special_tokens=False).input_ids if len(sub_ids) == 0: continue embed[tok_id] = embed[sub_ids].mean(dim=0) return model ``` ### SentencePiece merge (Llama-style) ```python import sentencepiece as spm from sentencepiece import sentencepiece_model_pb2 as sp_pb2 base = sp_pb2.ModelProto() base.ParseFromString(open("base.model", "rb").read()) domain = sp_pb2.ModelProto() domain.ParseFromString(open("domain_korean.model", "rb").read()) base_tokens = {p.piece for p in base.pieces} added = 0 for piece in domain.pieces: if piece.piece not in base_tokens: new = sp_pb2.ModelProto().SentencePiece() new.piece = piece.piece new.score = 0.0 base.pieces.append(new) added += 1 with open("merged.model", "wb") as f: f.write(base.SerializeToString()) print(f"Merged: +{added} tokens") ``` ### FOCUS-style cross-lingual init ```python # For each new token: find k-NN among OLD tokens via auxiliary embedding (e.g., fastText) # Initialize new embedding as weighted sum of those neighbors' LLM embeddings. def focus_init(new_tokens, aux_embs, llm_embed, old_vocab, k=10): init = {} for tok in new_tokens: if tok not in aux_embs: continue sims = {o: cos(aux_embs[tok], aux_embs[o]) for o in old_vocab if o in aux_embs} top = sorted(sims.items(), key=lambda x: -x[1])[:k] weights = torch.softmax(torch.tensor([s for _, s in top]) / 0.1, dim=0) ids = [old_vocab[o] for o, _ in top] init[tok] = (weights.unsqueeze(1) * llm_embed[ids]).sum(0) return init ``` ### Tied weights handling (LM head ↔ input embedding) ```python if model.config.tie_word_embeddings: # resize_token_embeddings handles both — verify assert model.get_input_embeddings().weight.data_ptr() == \ model.get_output_embeddings().weight.data_ptr() else: # Independently init the LM head rows for new tokens lm_head = model.get_output_embeddings().weight.data input_emb = model.get_input_embeddings().weight.data with torch.no_grad(): for tok_id in new_token_ids: lm_head[tok_id] = input_emb[tok_id].clone() ``` ### Continued pretraining 의 lr schedule ```python from transformers import get_cosine_schedule_with_warmup # Freeze old embeddings 의 gradient mask 의 trick embed = model.get_input_embeddings() new_token_mask = torch.zeros(len(tokenizer), dtype=torch.bool) new_token_mask[old_vocab_size:] = True def mask_grad_hook(grad): grad[~new_token_mask] = 0 # only update new tokens initially return grad embed.weight.register_hook(mask_grad_hook) # ... train for N steps, then remove hook for full fine-tune ... ``` ### Vocab unused-slot reuse ```python # Llama / Mistral 의 reserved token 의 in-place rename # Vocab size 의 unchanged → 매 inference cost 의 zero-impact upgrade spm_model = sp_pb2.ModelProto() spm_model.ParseFromString(open("tokenizer.model", "rb").read()) for i, piece in enumerate(spm_model.pieces): if piece.piece.startswith("