--- id: wiki-2026-0508-data-cleaning-algorithms title: Data Cleaning Algorithms category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-AUTO-DCAL-001] duplicate_of: none source_trust_level: A confidence_score: 0.92 tags: [auto-reinforced, data-cleaning, data-preProcessing, algorithms, outliers, duplicate-detection] raw_sources: [] last_reinforced: 2026-04-20 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) tech_stack: language: unspecified framework: unspecified --- # [[Data Cleaning Algorithms|Data Cleaning Algorithms]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ง€์‹์˜ ํ•„ํ„ฐ๋ง: 'Garbage In, Garbage Out'์˜ ์ €์ฃผ๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด, ๋ฐ์ดํ„ฐ ์†์˜ ๋…ธ์ด์ฆˆ, ์ค‘๋ณต, ์˜ค๋ฅ˜๋ฅผ ์ž๋™์œผ๋กœ ์‹๋ณ„ํ•˜๊ณ  ๊ต์ •ํ•˜์—ฌ AI๊ฐ€ ์˜ค์ง '์ •์ˆ˜(Essence)'๋งŒ์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋„๋ก ๋‹ฆ๊ณ  ์กฐ์ด๋Š” ์ง€์  ์„ธ์ฒ™ ๊ณต์ •." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ๋ฐ์ดํ„ฐ ์ •์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜(Data Cleaning Algorithms)์€ ๋ฐ์ดํ„ฐ์…‹์˜ ํ’ˆ์งˆ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์˜ค๋ฅ˜๋ฅผ ์ˆ˜์ •ํ•˜๊ณ  ์ผ๊ด€์„ฑ์„ ํ™•๋ณดํ•˜๋Š” ๊ธฐ๋ฒ•๋“ค์ž…๋‹ˆ๋‹ค. 1. **์ฃผ์š” ํƒœ์Šคํฌ ๋ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜**: * **Missing Value Imputation**: ํ‰๊ท , ์ตœ๋นˆ๊ฐ’ ํ˜น์€ KNN/ํšŒ๊ท€ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๋น„์–ด์žˆ๋Š” ๊ฐ’ ์ฑ„์šฐ๊ธฐ. * **Outlier Detection**: Z-Score, Isolation Forest ๋“ฑ์„ ์ด์šฉํ•ด ์ •์ƒ ๋ฒ”์œ„๋ฅผ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚œ ์ด์ƒ์น˜ ์ œ๊ฑฐ. ([[Anomaly-Detection|Anomaly-Detection]]๊ณผ ์—ฐ๊ฒฐ) * **Deduplication (์ค‘๋ณต ์ œ๊ฑฐ)**: ํ•ด์‹œ ๋งค์นญ์ด๋‚˜ ํŽธ์ง‘ ๊ฑฐ๋ฆฌ(Levenshtein Distance)๋ฅผ ์ด์šฉํ•ด ๊ฒน์น˜๋Š” ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ. * **Standardization**: ๋‹จ์œ„๋‚˜ ํ˜•์‹์„ ํ†ต์ผ (์˜ˆ: ๋‚ ์งœ ํฌ๋งท ํ†ต์ผ). 2. **์™œ ์ค‘์š”ํ•œ๊ฐ€?**: * ์ „์ฒด AI ํ”„๋กœ์ ํŠธ ์‹œ๊ฐ„์˜ 80%๋ฅผ ์ฐจ์ง€ํ•˜๋ฉฐ, ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์ƒํ•œ์„ ์„ ๊ฒฐ์ •์ง“๋Š” ๊ฐ€์žฅ ์‹ค๋ฌด์ ์ธ ์˜์—ญ์ž„. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ**: ๊ณผ๊ฑฐ์—๋Š” ์‚ฌ๋žŒ์ด ์—‘์…€๋กœ '๋ˆˆ๋Œ€์ค‘ ์ •์ œ'๋ฅผ ํ•˜๋Š” ์ •์ฑ…์ด์—ˆ์œผ๋‚˜, ํ˜„๋Œ€ ์ •์ฑ…์€ ์ˆ˜์‹ญ์–ต ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ์ฒ˜๋ฆฌํ•˜๋Š” 'ํ™•๋ฅ ์  ๋ฐ์ดํ„ฐ ์ •์ œ ์ •์ฑ…'๊ณผ 'AI๋ฅผ ์ด์šฉํ•œ AI ๋ฐ์ดํ„ฐ ์ •์ œ ์ •์ฑ…'์œผ๋กœ ์ž๋™ํ™”๋จ(RL Update). - **์ •์ฑ… ๋ณ€ํ™”(RL Update)**: ๊ฑฐ๋Œ€ ์–ธ์–ด ๋ชจ๋ธ ํ•™์Šต ์‹œ, ์ €ํ’ˆ์งˆ ์›น ํ…์ŠคํŠธ๋ฅผ ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ ์œ„ํ•ด '์ง€๋Šฅํ˜• ๋ถ„๋ฅ˜๊ธฐ(Classifier)'๋ฅผ ํ†ตํ•œ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ์„ ๋ณ„ ์ •์ฑ…์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ํ•ต์‹ฌ ๊ธฐ๋ฐ€ ์ •์ฑ…์ด ๋จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Anomaly-Detection|Anomaly-Detection]], [[Statistics & Data Analysis|Statistics & Data Analysis]], [[Optimization|Optimization]], [[Quality Gates|Quality Gates]], [[Signal in Noise|Signal in Noise]] - **Modern Tech/Tools**: Pandas, Scikit-learn, Great Expectations, DVC. --- ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - *(TODO)* **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - *(TODO)* ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** needs_review - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** A - **๊ฒ€ํ†  ์ด์œ :** *(P-Reinforce Phase 1 ์ž๋™ ์ •๊ทœํ™”. ๋ณธ๋ฌธ ๊ฒ€์ฆ ํ•„์š”.)* ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** *(TODO: ์ธ๋ฑ์„œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌํฌํŠธ ์ฐธ์กฐ)* - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** UPDATE (์ž๋™ ์ •๊ทœํ™”) - **์ฒ˜๋ฆฌ ์ด์œ :** Phase 1 ์ •๊ทœํ™” โ€” ์˜› ํ…œํ”Œ๋ฆฟ/๋ˆ„๋ฝ ํ•„๋“œ ๋ณด๊ฐ•. ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” (frontmatter + ํ—ค๋” ํ‘œ์ค€ํ™”) | UPDATE | A | ## ๐Ÿ’ป ์ฝ”๋“œ ํŒจํ„ด (Code Patterns) **ํŒจํ„ด 1:** *(TODO: ์ด ํ”„๋กœ์ ํŠธ ์ปจ๋ฒค์…˜ ๋ฐ˜์˜ํ•œ ๊ตฌ์กฐ ์Šค์ผˆ๋ ˆํ†ค)* ```text # TODO ``` ## ๐Ÿค” ์˜์‚ฌ๊ฒฐ์ • ๊ธฐ์ค€ (Decision Criteria) **์„ ํƒ A๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **์„ ํƒ B๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **๊ธฐ๋ณธ๊ฐ’:** > *(TODO)* ## โŒ ์•ˆํ‹ฐํŒจํ„ด (Anti-Patterns) - **[์•ˆํ‹ฐํŒจํ„ด]:** *(TODO: ๋ฌด์—‡์„ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€ + ์ด์œ  + ๋Œ€์‹  ๋ฌด์—‡์„)*