--- id: wiki-2026-0508-tokenization-subword-processing title: "Tokenization & Subword Processing" category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-AUTO-TKNP-001] duplicate_of: none source_trust_level: A confidence_score: 1.0 tags: [auto-reinforced, tokenization, bpe, wordpiece, subword-tokenizer, nlp-preprocessing] raw_sources: [] last_reinforced: 2026-05-04 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) tech_stack: language: unspecified framework: unspecified --- # [[Tokenization & Subword Processing|Tokenization & Subword Processing]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์–ธ์–ด์˜ ์›์žํ™”: ์ธ๊ฐ„์˜ ๋ฌธ์žฅ์„ ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ˆซ์ž ์กฐ๊ฐ(Token)์œผ๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ณผ์ •์ด๋ฉฐ, ์ด ๋ถ„ํ•ด ๋ฐฉ์‹์˜ ํšจ์œจ์„ฑ์ด ๋ชจ๋ธ์˜ ์ง€๋Šฅ, ์†๋„, ๊ทธ๋ฆฌ๊ณ  ์šด์˜ ๋น„์šฉ์„ ๊ฒฐ์ •์ง“๋Š” AI์˜ ์ฒซ ๋ฒˆ์งธ ๊ด€๋ฌธ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ํ† ํฐํ™”(Tokenization)๋Š” ํ…์ŠคํŠธ๋ฅผ ๋ชจ๋ธ์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์†Œ ๋‹จ์œ„์ธ ํ† ํฐ์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. 1. **์ฃผ์š” ๋ฐฉ์‹**: * **BPE (Byte-Pair Encoding)**: ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์ž ์Œ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ณ‘ํ•ฉํ•˜์—ฌ ํ† ํฐ ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. (GPT, Llama ๋“ฑ ํ‘œ์ค€) * **WordPiece**: BPE์™€ ์œ ์‚ฌํ•˜๋‚˜, ๋ณ‘ํ•ฉ ์‹œ ์–ธ์–ด ๋ชจ๋ธ์˜ ์šฐ๋„(Likelihood) ์ฆ๊ฐ€๋Ÿ‰์„ ๊ธฐ์ค€์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. (BERT ๊ณ„์—ด) * **SentencePiece**: ์‚ฌ์ „ ํ›ˆ๋ จ ์—†์ด ํ…์ŠคํŠธ ์ „๋ฐ˜์„ ๋ฐ”์ดํŠธ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋‹ค๊ตญ์–ด ๋ฐ ๋ฏธ๋“ฑ๋ก์–ด(OOV) ๋Œ€์‘์— ๊ฐ•์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. 2. **์˜๋ฏธ์  ๋‹จ์œ„**: * ํ˜„๋Œ€ ํ† ํฌ๋‚˜์ด์ €๋Š” ๋‹จ์–ด ์ „์ฒด๊ฐ€ ์•„๋‹Œ 'ํ•˜์œ„ ๋‹จ์–ด(Subword)' ๋‹จ์œ„๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด "unhappiness"๋ฅผ "un", "happi", "ness"๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ๋ถ€๋ถ„์˜ ์˜๋ฏธ๋ฅผ ์กฐํ•ฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 3. **ํ† ํฐ ์‚ฌ์ „ ํฌ๊ธฐ (Vocab Size)**: * ์‚ฌ์ „์ด ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ๋ฌธ์žฅ์ด ๋„ˆ๋ฌด ๋งŽ์€ ํ† ํฐ์œผ๋กœ ์ชผ๊ฐœ์ ธ ์—ฐ์‚ฐ ํšจ์œจ์ด ๋–จ์–ด์ง€๊ณ , ๋„ˆ๋ฌด ํฌ๋ฉด ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋‚ญ๋น„๋ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต 32k ~ 128k ์‚ฌ์ด์—์„œ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) * **๋‹ค๊ตญ์–ด ๋ถˆ๊ท ํ˜•**: ์˜์–ด๋Š” ๋‹จ์–ด๋‹น ํ† ํฐ ์ˆ˜๊ฐ€ ์ ์ง€๋งŒ, ํ•œ๊ตญ์–ด๋‚˜ ๋‹ค๋ฅธ ์–ธ์–ด๋Š” ๋™์ผํ•œ ์˜๋ฏธ๋ผ๋„ ํ›จ์”ฌ ๋งŽ์€ ํ† ํฐ์œผ๋กœ ์ชผ๊ฐœ์ ธ ๋น„์šฉ์ด ๋น„์‹ธ๊ณ  ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **๋น„๊ฒฐ์ •๋ก ์  ์ด์Šˆ**: ํ† ํฌ๋‚˜์ด์ €์˜ ์‚ฌ์†Œํ•œ ์ฐจ์ด๊ฐ€ ๋ชจ๋ธ์˜ ์‚ฐ์ˆ  ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์ด๋‚˜ ํŠน์ˆ˜ ๋ฌธ์ž ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Natural Language Processing (NLP)|NLP]], [[Transformer Architecture|Transformer Architecture]] * **ํ•˜์œ„ ์‹œ์Šคํ…œ**: [[Tokenization Economics|Tokenization Economics]] * **์—ฐ๊ด€ ๋ฌผ๋ฆฌ ์ œ์•ฝ**: [[Context Window & Long-Context LLMs|Context Window]], [[KV Cache|KV Cache]] --- *Last updated: 2026-05-04* ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - *(TODO)* **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - *(TODO)* ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** needs_review - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** A - **๊ฒ€ํ†  ์ด์œ :** *(P-Reinforce Phase 1 ์ž๋™ ์ •๊ทœํ™”. ๋ณธ๋ฌธ ๊ฒ€์ฆ ํ•„์š”.)* ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** *(TODO: ์ธ๋ฑ์„œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌํฌํŠธ ์ฐธ์กฐ)* - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** UPDATE (์ž๋™ ์ •๊ทœํ™”) - **์ฒ˜๋ฆฌ ์ด์œ :** Phase 1 ์ •๊ทœํ™” โ€” ์˜› ํ…œํ”Œ๋ฆฟ/๋ˆ„๋ฝ ํ•„๋“œ ๋ณด๊ฐ•. ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” (frontmatter + ํ—ค๋” ํ‘œ์ค€ํ™”) | UPDATE | A | ## ๐Ÿ’ป ์ฝ”๋“œ ํŒจํ„ด (Code Patterns) **ํŒจํ„ด 1:** *(TODO: ์ด ํ”„๋กœ์ ํŠธ ์ปจ๋ฒค์…˜ ๋ฐ˜์˜ํ•œ ๊ตฌ์กฐ ์Šค์ผˆ๋ ˆํ†ค)* ```text # TODO ``` ## ๐Ÿค” ์˜์‚ฌ๊ฒฐ์ • ๊ธฐ์ค€ (Decision Criteria) **์„ ํƒ A๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **์„ ํƒ B๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **๊ธฐ๋ณธ๊ฐ’:** > *(TODO)* ## โŒ ์•ˆํ‹ฐํŒจํ„ด (Anti-Patterns) - **[์•ˆํ‹ฐํŒจํ„ด]:** *(TODO: ๋ฌด์—‡์„ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€ + ์ด์œ  + ๋Œ€์‹  ๋ฌด์—‡์„)*