--- id: DATA-PRE-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 1.0 tags: [data-science, ai, machine-learning, preprocessing, data-cleaning, feature-engineering, normalization] last_reinforced: 2026-04-26 --- # Pre-processing Data for AI (AIλ₯Ό μœ„ν•œ 데이터 μ „μ²˜λ¦¬) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "λ°μ΄ν„°μ˜ 날것 κ·ΈλŒ€λ‘œλ₯Ό μ‹ λ’°ν•˜μ§€ 말고, μ§€λŠ₯이 μ†Œν™”ν•˜κΈ° κ°€μž₯ νŽΈμ•ˆν•œ ν˜•νƒœλ‘œ μ •μ œν•˜κ³  κ·œκ²©ν™”ν•˜μ—¬ λͺ¨λΈμ˜ 잠재λ ₯을 ν•΄λ°©ν•˜λΌ" β€” λΆ„μ„μ΄λ‚˜ ν•™μŠ΅μ— μ ν•©ν•˜μ§€ μ•Šμ€ μ›μ‹œ 데이터λ₯Ό 데이터 ν’ˆμ§ˆμ„ 높이고 ν•™μŠ΅ νš¨μœ¨μ„ μ΅œμ ν™”ν•˜κΈ° μœ„ν•΄ κ°€κ³΅ν•˜λŠ” λͺ¨λ“  인곡지λŠ₯ μ›Œν¬ν”Œλ‘œμš°μ˜ μ΅œμš°μ„  κ³Όμ •. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** "Data Cleansing and Structural Alignment" β€” λΆˆμ™„μ „ν•œ 기둝(Missing value)을 λ©”μš°κ³ , 극단적인 κ°’(Outlier)을 μ²˜λ¦¬ν•˜λ©°, μ„œλ‘œ λ‹€λ₯Έ λ‹¨μœ„μ˜ μˆ«μžλ“€μ„ λ™μΌν•œ λ²”μœ„λ‘œ λ§žμΆ”μ–΄(Scaling) λͺ¨λΈμ΄ νŠΉμ • λ³€μˆ˜μ—λ§Œ νœ˜λ‘˜λ¦¬μ§€ μ•Šκ²Œ λ§Œλ“œλŠ” νŒ¨ν„΄. - **μ£Όμš” μž‘μ—… 단계:** - **Cleaning:** μ˜€νƒ€ μˆ˜μ •, 결츑치 처리(Imputation), 쀑볡 데이터 제거. - **Transformation:** μ •κ·œν™”(Normalization), ν‘œμ€€ν™”(Standardization), 둜그 λ³€ν™˜. - **Reduction:** 차원 μΆ•μ†Œ(PCA), νŠΉμ§• 선택(Feature Selection). - **Discretization:** μ—°μ†ν˜• λ³€μˆ˜λ₯Ό λ²”μ£Όν˜•μœΌλ‘œ λ³€ν™˜. - **의의:** 전체 데이터 μ‚¬μ΄μ–ΈμŠ€ μ—…λ¬΄μ˜ 80% 이상을 μ°¨μ§€ν•˜λŠ” 핡심 λ…Έλ™μ΄μž, λͺ¨λΈμ˜ μ„±λŠ₯ ν•˜ν•œμ„ μ„ κ²°μ •μ§“λŠ” κ°€μž₯ μ‹€μ§ˆμ μΈ ν’ˆμ§ˆ 관리 κ³Όμ •. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** μ‚¬λžŒμ΄ 일일이 κ·œμΉ™μ„ λ§Œλ“€μ–΄ μ „μ²˜λ¦¬ν•˜λ˜ λ°©μ‹μ—μ„œ, μ΄μ œλŠ” μ „μ²˜λ¦¬ κ³Όμ • 자체λ₯Ό ν•™μŠ΅ν•˜μ—¬ μ΅œμ ν™”ν•˜λŠ” Auto-Preprocessing 기술과 데이터 μœ νš¨μ„±μ„ μžλ™μœΌλ‘œ κ²€μ‚¬ν•˜λŠ” Data Observability 도ꡬ듀이 ν•„μˆ˜μ μœΌλ‘œ λ„μž…λ˜κ³  있음. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈλŠ” μ™ΈλΆ€ μ›μ‹œ μœ„ν‚€ 데이터λ₯Ό μ‹œμŠ€ν…œμœΌλ‘œ κ°€μ Έμ˜¬ λ•Œ, ν…μŠ€νŠΈ λ‚΄μ˜ λΆˆν•„μš”ν•œ λ§ˆν¬μ—…μ΄λ‚˜ 특수 기호λ₯Ό μ œκ±°ν•˜κ³  Karpathy μŠ€νƒ€μΌλ‘œ μž¬κ΅¬μ„±ν•˜κΈ° μœ„ν•œ μ „μš© NLP μ „μ²˜λ¦¬ 엔진을 가동함. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Normalization-Strategies|Normalization-Strategies]], [[Outlier-Detection-Techniques|Outlier-Detection-Techniques]], [[One-Hot-Encoding|One-Hot-Encoding]], [[Exploratory-Data-Analysis|Exploratory-Data-Analysis]] - **Raw Source:** 10_Wiki/Topics/AI/Pre-processing-Data-for-AI.md