--- id: [[P-Reinforce|P-Reinforce]]-AUTO-CHKP-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, chunking, data-preprocessing, rag-optimization, context-window] last_reinforced: 2026-05-04 --- # [[Chunking & Pre-processing|Chunking & Pre-processing]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ง€์‹์˜ ์กฐ๊ฐ๋‚ด๊ธฐ: ๋ฐฉ๋Œ€ํ•œ ๋ฌธ์„œ๋ฅผ ๋ชจ๋ธ์ด ์†Œํ™”ํ•˜๊ธฐ ๊ฐ€์žฅ ์ ์ ˆํ•œ ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ„๊ณ , ๋งฅ๋ฝ์ด ๋Š๊ธฐ์ง€ ์•Š๋„๋ก ์ •๊ตํ•˜๊ฒŒ ์—ฐ๊ฒฐํ•˜์—ฌ RAG์˜ ๊ฒ€์ƒ‰ ํ’ˆ์งˆ์„ ๊ฒฐ์ •์ง“๋Š” ๋ณด์ด์ง€ ์•Š๋Š” ๊ธฐ์ดˆ ๊ณต์‚ฌ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ์ฒญํ‚น(Chunking)์€ ๋Œ€๊ทœ๋ชจ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰๊ณผ ์ถ”๋ก ์— ์šฉ์ดํ•˜๋„๋ก ์ž‘์€ ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. 1. **์ฒญํ‚น ์ „๋žต**: * **Fixed-size Chunking**: ๊ณ ์ •๋œ ๊ธ€์ž ์ˆ˜๋‚˜ ํ† ํฐ ์ˆ˜๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ๋น ๋ฅด์ง€๋งŒ ๋ฌธ์žฅ ์ค‘๊ฐ„์ด ์ž˜๋ฆฌ๋Š” ๋“ฑ ๋งฅ๋ฝ ํŒŒ๊ดด ์œ„ํ—˜์ด ํฝ๋‹ˆ๋‹ค. * **Recursive Character Chunking**: ๋ฌธ๋‹จ, ๋ฌธ์žฅ, ๋‹จ์–ด ๋‹จ์œ„๋กœ ์šฐ์„ ์ˆœ์œ„๋ฅผ ๋‘์–ด ๋…ผ๋ฆฌ์  ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. * **Semantic Chunking**: ๋ฌธ์žฅ ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜์—ฌ, ์ฃผ์ œ๊ฐ€ ๋ฐ”๋€Œ๋Š” ์ง€์ ์—์„œ ๋ฌธ์„œ๋ฅผ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. * **Agentic Chunking**: ์—์ด์ „ํŠธ๊ฐ€ ๋ฌธ์„œ๋ฅผ ์ฝ๊ณ  ์˜๋ฏธ ๋‹จ์œ„๋ฅผ ํŒ๋‹จํ•˜์—ฌ ์ตœ์ ์˜ ์ง€์ ์—์„œ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. 2. **์ „์ฒ˜๋ฆฌ (Pre-processing)**: * **Cleaning**: ๋ถˆํ•„์š”ํ•œ ํŠน์ˆ˜๋ฌธ์ž, HTML ํƒœ๊ทธ, ์ค‘๋ณต ํ…์ŠคํŠธ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. * **Metadata ์ฃผ์ž…**: ๊ฐ ์ฒญํฌ์— ์ œ๋ชฉ, ์š”์•ฝ, ์ถœ์ฒ˜, ๊ด€๋ จ ํ‚ค์›Œ๋“œ ๋“ฑ์„ ํƒœ๊น…ํ•˜์—ฌ ๊ฒ€์ƒ‰ ํšจ์œจ์„ ๋†’์ž…๋‹ˆ๋‹ค. 3. **Overlap (์ค‘์ฒฉ)**: * ์ฒญํฌ์™€ ์ฒญํฌ ์‚ฌ์ด์— ์ผ์ • ๋ถ€๋ถ„์„ ๊ฒน์น˜๊ฒŒ ํ•˜์—ฌ(์˜ˆ: 10% ์ค‘์ฒฉ), ์ž˜๋ฆฐ ๋ฌธ์žฅ์˜ ๋งฅ๋ฝ์ด ์–‘์ชฝ ์ฒญํฌ ๋ชจ๋‘์— ์œ ์ง€๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **์ฒญํฌ ํฌ๊ธฐ ๋”œ๋ ˆ๋งˆ**: ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ๋งฅ๋ฝ์ด ๋ถ€์กฑํ•˜๊ณ (Lack of context), ๋„ˆ๋ฌด ํฌ๋ฉด ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์— ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์•„์ง€๋ฉฐ ๋ชจ๋ธ์˜ ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ๋ฅผ ๋‚ญ๋น„ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. * **์—ฐ์‚ฐ ๋น„์šฉ**: Semantic Chunking์ด๋‚˜ Agentic Chunking์€ ๋ชจ๋ธ ํ˜ธ์ถœ์ด ํ•„์š”ํ•˜๋ฏ€๋กœ ์ฒ˜๋ฆฌ ๋น„์šฉ๊ณผ ์‹œ๊ฐ„์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ์‹œ์Šคํ…œ**: [[Retrieval-Augmented Generation (RAG)|Retrieval-Augmented Generation (RAG)]] * **ํ•˜์œ„ ์‹œ์Šคํ…œ**: [[Vector Databases & Search|Vector Databases & Search]], [[Embedding Models & MRL|Embedding Models & MRL]] * **์—ฐ๊ด€ ํ˜„์ƒ**: [[Lost in the middle|Lost in the middle]] --- *Last updated: 2026-05-04*