--- id: wiki-2026-0508-data-cleaning title: Data Cleaning Algorithms category: 10_Wiki/Topics status: verified canonical_id: self aliases: [data cleaning, data quality, imputation, outlier detection, deduplication, Great Expectations, dvc] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [data-quality, cleaning, imputation, outlier, deduplication, mlops, great-expectations, llm-data-curation] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: Pandas / scikit-learn / Great Expectations / DVC --- # Data Cleaning ## 매 한 줄 > **"매 garbage in, garbage out 의 prevent"**. 매 80% 의 AI project time. 매 imputation + 매 outlier + 매 dedup + 매 standardization. 매 modern: 매 LLM-aided + 매 data quality framework (GE) + 매 quality classifier (LLM pretrain). ## 매 핵심 task ### Missing value - **Drop**: 매 small fraction. - **Mean / median / mode**: 매 simple. - **KNN imputation**: 매 nearest neighbor. - **Regression imputation**: 매 model. - **MICE** (Multiple Imputation by Chained Equations). - **Indicator + impute**: 매 missingness pattern 의 useful. ### Outlier - **Z-score**: 매 univariate. - **IQR**: 매 robust univariate. - **Isolation Forest**: 매 multivariate. - **DBSCAN / LOF**: 매 density. - **Autoencoder**: 매 reconstruction error. - **Domain-specific rule**. ### Deduplication - **Exact**: 매 hash. - **Fuzzy**: Levenshtein, MinHash, SimHash. - **Record linkage**: 매 multi-field. - **dedupe.io**: 매 ML-based. ### Standardization - 매 date format. - 매 unit (m / ft, kg / lb). - 매 case / encoding (UTF-8 normalize). - 매 categorical (lower, strip). ### Validation - **Schema** (Pandera, Pydantic). - **Statistical** (Great Expectations). - **Constraint** (DBT tests). ### LLM-specific data curation - **Quality classifier**: 매 web text 의 grade. - **Toxicity / PII filter**. - **Deduplication** (MinHash for trillion tokens). - **Quality decile sampling** (Llama-3 trick). ### 매 modern stack - **Pandas** + scikit-learn. - **Great Expectations**: 매 data quality framework. - **DVC** (Data Version Control). - **Dagster / Airflow**: 매 pipeline. - **dbt**: 매 SQL transformation. - **Pandera**: 매 schema validation. ## 💻 패턴 ### Missing value imputation ```python import pandas as pd from sklearn.impute import KNNImputer, IterativeImputer df = pd.read_csv('data.csv') # 매 simple df.fillna(df.mean(), inplace=True) df.fillna(df.mode().iloc[0], inplace=True) # 매 categorical # 매 KNN imputer = KNNImputer(n_neighbors=5) df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns) # 매 MICE (multivariate) from sklearn.experimental import enable_iterative_imputer imputer = IterativeImputer(max_iter=10, random_state=42) df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns) ``` ### Outlier detection (Isolation Forest) ```python from sklearn.ensemble import IsolationForest iso = IsolationForest(contamination=0.05, random_state=42) outliers = iso.fit_predict(X) # 매 -1 = outlier, 1 = normal clean = X[outliers == 1] ``` ### IQR (univariate) ```python def iqr_outlier_remove(df, col, k=1.5): Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75) IQR = Q3 - Q1 return df[(df[col] >= Q1 - k * IQR) & (df[col] <= Q3 + k * IQR)] ``` ### Deduplication (MinHash for huge data) ```python from datasketch import MinHash, MinHashLSH def get_minhash(text, num_perm=128): m = MinHash(num_perm=num_perm) for word in text.split(): m.update(word.encode('utf-8')) return m lsh = MinHashLSH(threshold=0.8, num_perm=128) seen = [] unique_docs = [] for i, doc in enumerate(corpus): m = get_minhash(doc) if not lsh.query(m): lsh.insert(f'doc{i}', m) unique_docs.append(doc) ``` ### Fuzzy match (Levenshtein) ```python from rapidfuzz import fuzz, process # 매 매 record matching matches = process.extract('John Smith', candidates, scorer=fuzz.ratio, limit=5) ``` ### Schema validation (Pandera) ```python import pandera as pa schema = pa.DataFrameSchema({ 'user_id': pa.Column(int, checks=pa.Check.greater_than(0)), 'email': pa.Column(str, checks=pa.Check.str_matches(r'.+@.+\..+')), 'age': pa.Column(int, checks=[ pa.Check.greater_than_or_equal_to(0), pa.Check.less_than(150), ]), }) clean_df = schema.validate(df) ``` ### Great Expectations (data quality framework) ```python import great_expectations as ge context = ge.get_context() suite = context.create_expectation_suite('user_data') batch = context.get_validator( batch_request=..., expectation_suite=suite, ) batch.expect_column_values_to_not_be_null('user_id') batch.expect_column_values_to_be_unique('user_id') batch.expect_column_values_to_be_between('age', 0, 150) batch.expect_column_value_lengths_to_be_between('email', 5, 100) results = batch.validate() ``` ### LLM data curation (quality classifier) ```python def quality_classifier(text, threshold=0.5): """매 web text 의 quality 의 grade.""" features = { 'length_ok': 100 < len(text) < 10000, 'punctuation_ratio': sum(c in '.,!?' for c in text) / max(len(text), 1), 'caps_ratio': sum(c.isupper() for c in text) / max(len(text), 1), 'has_url_spam': sum(1 for w in text.split() if w.startswith('http')) > 5, 'language_ok': detect_language(text) == 'en', } score = ( features['length_ok'] * 0.3 + (0.005 < features['punctuation_ratio'] < 0.05) * 0.2 + (features['caps_ratio'] < 0.3) * 0.2 + (not features['has_url_spam']) * 0.2 + features['language_ok'] * 0.1 ) return score > threshold ``` ### PII detection / removal ```python import re PATTERNS = { 'email': r'[\w.+-]+@[\w-]+\.[\w.-]+', 'phone': r'\+?[\d\s().-]{10,}', 'ssn': r'\d{3}-\d{2}-\d{4}', 'credit_card': r'\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}', } def redact_pii(text): for name, pattern in PATTERNS.items(): text = re.sub(pattern, f'[{name.upper()}_REDACTED]', text) return text ``` ### LLM-aided cleaning ```python def llm_clean(text, instruction='Fix typos and standardize format'): return llm.generate(f"""{instruction}. Return only the cleaned text, no explanation. Input: {text}""") ``` ### Data version control (DVC) ```bash # 매 매 cleaning step 의 versioned data dvc add data/raw/users.csv dvc add data/cleaned/users.parquet dvc run -n clean_users \ -d data/raw/users.csv \ -o data/cleaned/users.parquet \ python clean_users.py ``` ### Pipeline (Airflow / Dagster) ```python @dagster.asset def cleaned_users(raw_users): df = raw_users df = remove_duplicates(df) df = impute_missing(df) df = remove_outliers(df) df = standardize_format(df) schema.validate(df) return df ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Missing < 5% | Drop or simple fill | | Missing 5-30% | KNN / MICE | | Outlier univariate | IQR / Z-score | | Outlier multivariate | Isolation Forest | | Exact dup | Hash | | Near dup (text) | MinHash + LSH | | Schema | Pandera / GE | | LLM data | Quality classifier + dedup + PII filter | **기본값**: Pandas + GE schema + Isolation Forest + MinHash dedup. ## 🔗 Graph - 부모: [[MLOps]] · [[Data-Engineering]] · [[Statistics]] - 변형: [[Imputation]] · [[Outlier-Detection]] · [[Deduplication]] · [[Standardization]] - 응용: [[Great-Expectations]] · [[DVC]] - Adjacent: [[Concept-Drift]] · [[Algorithmic-Fairness]] · [[Bias-Correction-Algorithm]] · [[CV_Synthesis]] ## 🤖 LLM 활용 **언제**: 매 ETL pipeline. 매 ML data prep. 매 LLM training data 의 curate. **언제 X**: 매 already clean source. 매 already-validated. ## ❌ 안티패턴 - **No schema**: 매 silent drift. - **Drop missing 의 always**: 매 informative. - **Single-method outlier**: 매 false positive. - **No version**: 매 reproducibility X. - **PII 의 leak**: 매 GDPR violation. - **Raw → train (no clean)**: 매 garbage learn. ## 🧪 검증 / 중복 - Verified (Pandas docs, scikit-learn, Great Expectations, Llama-3 paper data curation). - 신뢰도 A. - Related: [[Concept-Drift]] · [[Algorithmic-Fairness]] · [[Bias-Correction-Algorithm]] · [[Algorithmic-Biology]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — task + 매 KNN / MICE / Isolation Forest / MinHash / GE / PII code |