---
id: wiki-2026-0508-data-cleaning
title: Data Cleaning Algorithms
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [data cleaning, data quality, imputation, outlier detection, deduplication, Great Expectations, dvc]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [data-quality, cleaning, imputation, outlier, deduplication, mlops, great-expectations, llm-data-curation]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: Pandas / scikit-learn / Great Expectations / DVC
---

# Data Cleaning

## 매 한 줄
> **"매 garbage in, garbage out 의 prevent"**. 매 80% 의 AI project time. 매 imputation + 매 outlier + 매 dedup + 매 standardization. 매 modern: 매 LLM-aided + 매 data quality framework (GE) + 매 quality classifier (LLM pretrain).

## 매 핵심 task

### Missing value
- **Drop**: 매 small fraction.
- **Mean / median / mode**: 매 simple.
- **KNN imputation**: 매 nearest neighbor.
- **Regression imputation**: 매 model.
- **MICE** (Multiple Imputation by Chained Equations).
- **Indicator + impute**: 매 missingness pattern 의 useful.

### Outlier
- **Z-score**: 매 univariate.
- **IQR**: 매 robust univariate.
- **Isolation Forest**: 매 multivariate.
- **DBSCAN / LOF**: 매 density.
- **Autoencoder**: 매 reconstruction error.
- **Domain-specific rule**.

### Deduplication
- **Exact**: 매 hash.
- **Fuzzy**: Levenshtein, MinHash, SimHash.
- **Record linkage**: 매 multi-field.
- **dedupe.io**: 매 ML-based.

### Standardization
- 매 date format.
- 매 unit (m / ft, kg / lb).
- 매 case / encoding (UTF-8 normalize).
- 매 categorical (lower, strip).

### Validation
- **Schema** (Pandera, Pydantic).
- **Statistical** (Great Expectations).
- **Constraint** (DBT tests).

### LLM-specific data curation
- **Quality classifier**: 매 web text 의 grade.
- **Toxicity / PII filter**.
- **Deduplication** (MinHash for trillion tokens).
- **Quality decile sampling** (Llama-3 trick).

### 매 modern stack
- **Pandas** + scikit-learn.
- **Great Expectations**: 매 data quality framework.
- **DVC** (Data Version Control).
- **Dagster / Airflow**: 매 pipeline.
- **dbt**: 매 SQL transformation.
- **Pandera**: 매 schema validation.

## 💻 패턴

### Missing value imputation
```python
import pandas as pd
from sklearn.impute import KNNImputer, IterativeImputer

df = pd.read_csv('data.csv')

# 매 simple
df.fillna(df.mean(), inplace=True)
df.fillna(df.mode().iloc[0], inplace=True)  # 매 categorical

# 매 KNN
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# 매 MICE (multivariate)
from sklearn.experimental import enable_iterative_imputer
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
```

### Outlier detection (Isolation Forest)
```python
from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=42)
outliers = iso.fit_predict(X)  # 매 -1 = outlier, 1 = normal

clean = X[outliers == 1]
```

### IQR (univariate)
```python
def iqr_outlier_remove(df, col, k=1.5):
    Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
    IQR = Q3 - Q1
    return df[(df[col] >= Q1 - k * IQR) & (df[col] <= Q3 + k * IQR)]
```

### Deduplication (MinHash for huge data)
```python
from datasketch import MinHash, MinHashLSH

def get_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for word in text.split():
        m.update(word.encode('utf-8'))
    return m

lsh = MinHashLSH(threshold=0.8, num_perm=128)
seen = []
unique_docs = []

for i, doc in enumerate(corpus):
    m = get_minhash(doc)
    if not lsh.query(m):
        lsh.insert(f'doc{i}', m)
        unique_docs.append(doc)
```

### Fuzzy match (Levenshtein)
```python
from rapidfuzz import fuzz, process

# 매 매 record matching
matches = process.extract('John Smith', candidates, scorer=fuzz.ratio, limit=5)
```

### Schema validation (Pandera)
```python
import pandera as pa

schema = pa.DataFrameSchema({
    'user_id': pa.Column(int, checks=pa.Check.greater_than(0)),
    'email': pa.Column(str, checks=pa.Check.str_matches(r'.+@.+\..+')),
    'age': pa.Column(int, checks=[
        pa.Check.greater_than_or_equal_to(0),
        pa.Check.less_than(150),
    ]),
})

clean_df = schema.validate(df)
```

### Great Expectations (data quality framework)
```python
import great_expectations as ge

context = ge.get_context()
suite = context.create_expectation_suite('user_data')

batch = context.get_validator(
    batch_request=...,
    expectation_suite=suite,
)

batch.expect_column_values_to_not_be_null('user_id')
batch.expect_column_values_to_be_unique('user_id')
batch.expect_column_values_to_be_between('age', 0, 150)
batch.expect_column_value_lengths_to_be_between('email', 5, 100)

results = batch.validate()
```

### LLM data curation (quality classifier)
```python
def quality_classifier(text, threshold=0.5):
    """매 web text 의 quality 의 grade."""
    features = {
        'length_ok': 100 < len(text) < 10000,
        'punctuation_ratio': sum(c in '.,!?' for c in text) / max(len(text), 1),
        'caps_ratio': sum(c.isupper() for c in text) / max(len(text), 1),
        'has_url_spam': sum(1 for w in text.split() if w.startswith('http')) > 5,
        'language_ok': detect_language(text) == 'en',
    }
    score = (
        features['length_ok'] * 0.3 +
        (0.005 < features['punctuation_ratio'] < 0.05) * 0.2 +
        (features['caps_ratio'] < 0.3) * 0.2 +
        (not features['has_url_spam']) * 0.2 +
        features['language_ok'] * 0.1
    )
    return score > threshold
```

### PII detection / removal
```python
import re

PATTERNS = {
    'email': r'[\w.+-]+@[\w-]+\.[\w.-]+',
    'phone': r'\+?[\d\s().-]{10,}',
    'ssn': r'\d{3}-\d{2}-\d{4}',
    'credit_card': r'\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}',
}

def redact_pii(text):
    for name, pattern in PATTERNS.items():
        text = re.sub(pattern, f'[{name.upper()}_REDACTED]', text)
    return text
```

### LLM-aided cleaning
```python
def llm_clean(text, instruction='Fix typos and standardize format'):
    return llm.generate(f"""{instruction}.
Return only the cleaned text, no explanation.

Input: {text}""")
```

### Data version control (DVC)
```bash
# 매 매 cleaning step 의 versioned data
dvc add data/raw/users.csv
dvc add data/cleaned/users.parquet
dvc run -n clean_users \
    -d data/raw/users.csv \
    -o data/cleaned/users.parquet \
    python clean_users.py
```

### Pipeline (Airflow / Dagster)
```python
@dagster.asset
def cleaned_users(raw_users):
    df = raw_users
    df = remove_duplicates(df)
    df = impute_missing(df)
    df = remove_outliers(df)
    df = standardize_format(df)
    schema.validate(df)
    return df
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Missing < 5% | Drop or simple fill |
| Missing 5-30% | KNN / MICE |
| Outlier univariate | IQR / Z-score |
| Outlier multivariate | Isolation Forest |
| Exact dup | Hash |
| Near dup (text) | MinHash + LSH |
| Schema | Pandera / GE |
| LLM data | Quality classifier + dedup + PII filter |

**기본값**: Pandas + GE schema + Isolation Forest + MinHash dedup.

## 🔗 Graph
- 부모: [[MLOps]] · [[Data-Engineering]] · [[Statistics]]
- 변형: [[Imputation]] · [[Outlier-Detection]] · [[Deduplication]] · [[Standardization]]
- 응용: [[Great-Expectations]] · [[DVC]]
- Adjacent: [[Concept-Drift]] · [[Algorithmic-Fairness]] · [[Bias-Correction-Algorithm]] · [[CV_Synthesis]]

## 🤖 LLM 활용
**언제**: 매 ETL pipeline. 매 ML data prep. 매 LLM training data 의 curate.
**언제 X**: 매 already clean source. 매 already-validated.

## ❌ 안티패턴
- **No schema**: 매 silent drift.
- **Drop missing 의 always**: 매 informative.
- **Single-method outlier**: 매 false positive.
- **No version**: 매 reproducibility X.
- **PII 의 leak**: 매 GDPR violation.
- **Raw → train (no clean)**: 매 garbage learn.

## 🧪 검증 / 중복
- Verified (Pandas docs, scikit-learn, Great Expectations, Llama-3 paper data curation).
- 신뢰도 A.
- Related: [[Concept-Drift]] · [[Algorithmic-Fairness]] · [[Bias-Correction-Algorithm]] · [[Algorithmic-Biology]].

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — task + 매 KNN / MICE / Isolation Forest / MinHash / GE / PII code |