Files
2nd/10_Wiki/Topics/AI_and_ML/Vocabulary-Expansion.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

8.6 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-vocabulary-expansion Vocabulary Expansion 10_Wiki/Topics verified self
Vocab Expansion
Tokenizer Extension
Domain Vocabulary
none A 0.9 applied
nlp
tokenizer
vocabulary
llm
fine-tuning
2026-05-10 pending
language framework
python transformers, sentencepiece, tokenizers

Vocabulary Expansion

매 한 줄

"매 base tokenizer 에 domain token 을 grafting 하는 surgery". 매 BPE / SentencePiece tokenizer 의 vocab 을 확장 — 매 새 token embedding 의 initialize, 매 LM head 의 resize, 매 continued pretraining 의 alignment. 매 2026 의 Llama 3.x / Qwen 3 / Gemma 3 의 multilingual extension 의 standard recipe.

매 핵심

매 왜 expand

  • Tokenization efficiency: 매 Korean / Japanese / code 의 base tokenizer 의 over-fragmentation — "안녕하세요" 의 8 token 의 1 token 의 reduction.
  • Domain coverage: 매 medical / legal / chemistry term 의 single-token representation.
  • Inference cost: 매 sequence length 의 reduction 의 latency / cost 의 직접적 saving.
  • Quality: 매 long-tail token 의 gradient signal 의 improvement.

매 expansion 방식

  1. Pure addition: 매 base vocab 의 그대로 + 매 new token 의 append. Embedding matrix 의 row append.
  2. Merge new tokenizer: 매 domain corpus 의 새 BPE 의 train → 매 base 와 union → 매 conflict resolution.
  3. Token replacement: 매 unused token (e.g., <unused42>) 의 reuse — 매 vocab size 의 unchanged.

매 embedding init 전략

  • Mean init: 매 새 token 의 sub-word embedding 의 mean.
  • Random + small std: 매 N(0, 0.02) — 매 risky.
  • FOCUS / WECHSEL: 매 source language embedding 의 nearest-neighbor mapping.
  • OFA (One For All): 매 multilingual transfer 의 SOTA (2024).

매 응용

  1. 매 English-only LLM 의 Korean / Japanese / Arabic extension.
  2. 매 code LLM 의 새 language (Mojo, Zig) 의 token addition.
  3. 매 biomedical LLM (PubMedBERT) 의 specialized term integration.
  4. 매 retrieval-augmented model 의 special control token (<doc>, <query>) 추가.

💻 패턴

Tokenizer 확장 (HuggingFace)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-8B")

# Add domain tokens
new_tokens = ["[[CHEMICAL]]", "[[GENE]]", "ACE2", "SARS-CoV-2"]
num_added = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added} tokens, new vocab size: {len(tokenizer)}")

# Resize embedding + LM head
model.resize_token_embeddings(len(tokenizer))

Mean-init for new token embeddings

def init_new_embeddings_by_subword_mean(model, tokenizer, new_tokens, base_tokenizer):
    embed = model.get_input_embeddings().weight.data
    with torch.no_grad():
        for tok in new_tokens:
            tok_id = tokenizer.convert_tokens_to_ids(tok)
            # Tokenize the surface form with the BASE tokenizer
            sub_ids = base_tokenizer(tok, add_special_tokens=False).input_ids
            if len(sub_ids) == 0:
                continue
            embed[tok_id] = embed[sub_ids].mean(dim=0)
    return model

SentencePiece merge (Llama-style)

import sentencepiece as spm
from sentencepiece import sentencepiece_model_pb2 as sp_pb2

base = sp_pb2.ModelProto()
base.ParseFromString(open("base.model", "rb").read())
domain = sp_pb2.ModelProto()
domain.ParseFromString(open("domain_korean.model", "rb").read())

base_tokens = {p.piece for p in base.pieces}
added = 0
for piece in domain.pieces:
    if piece.piece not in base_tokens:
        new = sp_pb2.ModelProto().SentencePiece()
        new.piece = piece.piece
        new.score = 0.0
        base.pieces.append(new)
        added += 1

with open("merged.model", "wb") as f:
    f.write(base.SerializeToString())
print(f"Merged: +{added} tokens")

FOCUS-style cross-lingual init

# For each new token: find k-NN among OLD tokens via auxiliary embedding (e.g., fastText)
# Initialize new embedding as weighted sum of those neighbors' LLM embeddings.
def focus_init(new_tokens, aux_embs, llm_embed, old_vocab, k=10):
    init = {}
    for tok in new_tokens:
        if tok not in aux_embs:
            continue
        sims = {o: cos(aux_embs[tok], aux_embs[o]) for o in old_vocab if o in aux_embs}
        top = sorted(sims.items(), key=lambda x: -x[1])[:k]
        weights = torch.softmax(torch.tensor([s for _, s in top]) / 0.1, dim=0)
        ids = [old_vocab[o] for o, _ in top]
        init[tok] = (weights.unsqueeze(1) * llm_embed[ids]).sum(0)
    return init

Tied weights handling (LM head ↔ input embedding)

if model.config.tie_word_embeddings:
    # resize_token_embeddings handles both — verify
    assert model.get_input_embeddings().weight.data_ptr() == \
           model.get_output_embeddings().weight.data_ptr()
else:
    # Independently init the LM head rows for new tokens
    lm_head = model.get_output_embeddings().weight.data
    input_emb = model.get_input_embeddings().weight.data
    with torch.no_grad():
        for tok_id in new_token_ids:
            lm_head[tok_id] = input_emb[tok_id].clone()

Continued pretraining 의 lr schedule

from transformers import get_cosine_schedule_with_warmup

# Freeze old embeddings 의 gradient mask 의 trick
embed = model.get_input_embeddings()
new_token_mask = torch.zeros(len(tokenizer), dtype=torch.bool)
new_token_mask[old_vocab_size:] = True

def mask_grad_hook(grad):
    grad[~new_token_mask] = 0  # only update new tokens initially
    return grad

embed.weight.register_hook(mask_grad_hook)
# ... train for N steps, then remove hook for full fine-tune ...

Vocab unused-slot reuse

# Llama / Mistral 의 reserved <unusedN> token 의 in-place rename
# Vocab size 의 unchanged → 매 inference cost 의 zero-impact upgrade
spm_model = sp_pb2.ModelProto()
spm_model.ParseFromString(open("tokenizer.model", "rb").read())
for i, piece in enumerate(spm_model.pieces):
    if piece.piece.startswith("<reserved_") and i < 256:
        piece.piece = NEW_TOKENS.pop()
        if not NEW_TOKENS:
            break

Validation: tokenization rate

def tokens_per_char(tokenizer, corpus):
    total_tokens = total_chars = 0
    for doc in corpus:
        total_tokens += len(tokenizer(doc).input_ids)
        total_chars += len(doc)
    return total_tokens / total_chars

before = tokens_per_char(base_tok, korean_corpus)   # e.g., 0.8
after  = tokens_per_char(merged_tok, korean_corpus) # e.g., 0.4 — 2x compression

매 결정 기준

상황 Approach
매 small domain (<200 token) Pure addition + mean init
매 new language (10K+ token) Tokenizer merge + FOCUS / OFA init
매 inference cost 의 critical Reserved-slot reuse
매 multilingual extension OFA / WECHSEL + continued pretraining
매 control token 의 추가 Pure addition + random small init + SFT

기본값: 매 small additions 의 mean-init + 매 brief continued pretraining (1-5B token).

🔗 Graph

🤖 LLM 활용

언제: 매 base tokenizer 의 target language / domain 의 over-fragmentation 의 measurable. 매 corpus 의 1B+ token 의 continued pretraining budget 의 available. 언제 X: 매 small fine-tuning task 의 LoRA 의 sufficient. 매 domain coverage 의 already adequate (tokens_per_char < 0.5). 매 vocab change 의 deployment / serving infra 의 redeploy 의 forced 일 때.

안티패턴

  • Random init without continued PT: 매 새 token embedding 의 noise 의 catastrophic forgetting 의 trigger.
  • LM head 의 forget: 매 tied=False 의 model 의 input embedding 만 update — 매 generation broken.
  • Tokenizer merge 의 BOS / EOS 충돌: 매 special token ID 의 silently shifted — 매 inference 의 corrupt.
  • Vocab size 의 padding 의 무시: 매 GPU 의 vocab size % 64 == 0 의 efficiency 의 lost.
  • Continued PT skipping: 매 freshly initialized embedding 의 deployed → 매 hallucination spike.

🧪 검증 / 중복

  • Verified (HuggingFace transformers docs, FOCUS paper Dobler & de Melo 2023, OFA Liu et al. 2024, Llama 3 tokenizer release notes).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full content (NLP vocabulary expansion patterns / init strategies)