--- id: [[P-Reinforce|P-Reinforce]]-AUTO-TKNP-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, tokenization, bpe, wordpiece, subword-tokenizer, nlp-preprocessing] last_reinforced: 2026-05-04 --- # [[Tokenization & Subword Processing|Tokenization & Subword Processing]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์–ธ์–ด์˜ ์›์žํ™”: ์ธ๊ฐ„์˜ ๋ฌธ์žฅ์„ ๋ชจ๋ธ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ˆซ์ž ์กฐ๊ฐ(Token)์œผ๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ณผ์ •์ด๋ฉฐ, ์ด ๋ถ„ํ•ด ๋ฐฉ์‹์˜ ํšจ์œจ์„ฑ์ด ๋ชจ๋ธ์˜ ์ง€๋Šฅ, ์†๋„, ๊ทธ๋ฆฌ๊ณ  ์šด์˜ ๋น„์šฉ์„ ๊ฒฐ์ •์ง“๋Š” AI์˜ ์ฒซ ๋ฒˆ์งธ ๊ด€๋ฌธ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ํ† ํฐํ™”(Tokenization)๋Š” ํ…์ŠคํŠธ๋ฅผ ๋ชจ๋ธ์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์†Œ ๋‹จ์œ„์ธ ํ† ํฐ์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. 1. **์ฃผ์š” ๋ฐฉ์‹**: * **BPE (Byte-Pair Encoding)**: ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์ž ์Œ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋ณ‘ํ•ฉํ•˜์—ฌ ํ† ํฐ ์‚ฌ์ „์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. (GPT, Llama ๋“ฑ ํ‘œ์ค€) * **WordPiece**: BPE์™€ ์œ ์‚ฌํ•˜๋‚˜, ๋ณ‘ํ•ฉ ์‹œ ์–ธ์–ด ๋ชจ๋ธ์˜ ์šฐ๋„(Likelihood) ์ฆ๊ฐ€๋Ÿ‰์„ ๊ธฐ์ค€์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. (BERT ๊ณ„์—ด) * **SentencePiece**: ์‚ฌ์ „ ํ›ˆ๋ จ ์—†์ด ํ…์ŠคํŠธ ์ „๋ฐ˜์„ ๋ฐ”์ดํŠธ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋‹ค๊ตญ์–ด ๋ฐ ๋ฏธ๋“ฑ๋ก์–ด(OOV) ๋Œ€์‘์— ๊ฐ•์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. 2. **์˜๋ฏธ์  ๋‹จ์œ„**: * ํ˜„๋Œ€ ํ† ํฌ๋‚˜์ด์ €๋Š” ๋‹จ์–ด ์ „์ฒด๊ฐ€ ์•„๋‹Œ 'ํ•˜์œ„ ๋‹จ์–ด(Subword)' ๋‹จ์œ„๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด "unhappiness"๋ฅผ "un", "happi", "ness"๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ๋ถ€๋ถ„์˜ ์˜๋ฏธ๋ฅผ ์กฐํ•ฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 3. **ํ† ํฐ ์‚ฌ์ „ ํฌ๊ธฐ (Vocab Size)**: * ์‚ฌ์ „์ด ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ๋ฌธ์žฅ์ด ๋„ˆ๋ฌด ๋งŽ์€ ํ† ํฐ์œผ๋กœ ์ชผ๊ฐœ์ ธ ์—ฐ์‚ฐ ํšจ์œจ์ด ๋–จ์–ด์ง€๊ณ , ๋„ˆ๋ฌด ํฌ๋ฉด ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋‚ญ๋น„๋ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต 32k ~ 128k ์‚ฌ์ด์—์„œ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **๋‹ค๊ตญ์–ด ๋ถˆ๊ท ํ˜•**: ์˜์–ด๋Š” ๋‹จ์–ด๋‹น ํ† ํฐ ์ˆ˜๊ฐ€ ์ ์ง€๋งŒ, ํ•œ๊ตญ์–ด๋‚˜ ๋‹ค๋ฅธ ์–ธ์–ด๋Š” ๋™์ผํ•œ ์˜๋ฏธ๋ผ๋„ ํ›จ์”ฌ ๋งŽ์€ ํ† ํฐ์œผ๋กœ ์ชผ๊ฐœ์ ธ ๋น„์šฉ์ด ๋น„์‹ธ๊ณ  ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **๋น„๊ฒฐ์ •๋ก ์  ์ด์Šˆ**: ํ† ํฌ๋‚˜์ด์ €์˜ ์‚ฌ์†Œํ•œ ์ฐจ์ด๊ฐ€ ๋ชจ๋ธ์˜ ์‚ฐ์ˆ  ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์ด๋‚˜ ํŠน์ˆ˜ ๋ฌธ์ž ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Natural Language Processing (NLP)|NLP]], [[Transformer Architecture|Transformer Architecture]] * **ํ•˜์œ„ ์‹œ์Šคํ…œ**: [[Tokenization Economics|Tokenization Economics]] * **์—ฐ๊ด€ ๋ฌผ๋ฆฌ ์ œ์•ฝ**: [[Context Window & Long-Context LLMs|Context Window]], [[KV Cache|KV Cache]] --- *Last updated: 2026-05-04*