Files
2nd/10_Wiki/Topics/AI_and_ML/Data Cleaning Algorithms.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

8.3 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-data-cleaning Data Cleaning Algorithms 10_Wiki/Topics verified self
data cleaning
data quality
imputation
outlier detection
deduplication
Great Expectations
dvc
none A 0.93 applied
data-quality
cleaning
imputation
outlier
deduplication
mlops
great-expectations
llm-data-curation
2026-05-10 pending
language framework
Python Pandas / scikit-learn / Great Expectations / DVC

Data Cleaning

매 한 줄

"매 garbage in, garbage out 의 prevent". 매 80% 의 AI project time. 매 imputation + 매 outlier + 매 dedup + 매 standardization. 매 modern: 매 LLM-aided + 매 data quality framework (GE) + 매 quality classifier (LLM pretrain).

매 핵심 task

Missing value

  • Drop: 매 small fraction.
  • Mean / median / mode: 매 simple.
  • KNN imputation: 매 nearest neighbor.
  • Regression imputation: 매 model.
  • MICE (Multiple Imputation by Chained Equations).
  • Indicator + impute: 매 missingness pattern 의 useful.

Outlier

  • Z-score: 매 univariate.
  • IQR: 매 robust univariate.
  • Isolation Forest: 매 multivariate.
  • DBSCAN / LOF: 매 density.
  • Autoencoder: 매 reconstruction error.
  • Domain-specific rule.

Deduplication

  • Exact: 매 hash.
  • Fuzzy: Levenshtein, MinHash, SimHash.
  • Record linkage: 매 multi-field.
  • dedupe.io: 매 ML-based.

Standardization

  • 매 date format.
  • 매 unit (m / ft, kg / lb).
  • 매 case / encoding (UTF-8 normalize).
  • 매 categorical (lower, strip).

Validation

  • Schema (Pandera, Pydantic).
  • Statistical (Great Expectations).
  • Constraint (DBT tests).

LLM-specific data curation

  • Quality classifier: 매 web text 의 grade.
  • Toxicity / PII filter.
  • Deduplication (MinHash for trillion tokens).
  • Quality decile sampling (Llama-3 trick).

매 modern stack

  • Pandas + scikit-learn.
  • Great Expectations: 매 data quality framework.
  • DVC (Data Version Control).
  • Dagster / Airflow: 매 pipeline.
  • dbt: 매 SQL transformation.
  • Pandera: 매 schema validation.

💻 패턴

Missing value imputation

import pandas as pd
from sklearn.impute import KNNImputer, IterativeImputer

df = pd.read_csv('data.csv')

# 매 simple
df.fillna(df.mean(), inplace=True)
df.fillna(df.mode().iloc[0], inplace=True)  # 매 categorical

# 매 KNN
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# 매 MICE (multivariate)
from sklearn.experimental import enable_iterative_imputer
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

Outlier detection (Isolation Forest)

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=42)
outliers = iso.fit_predict(X)  # 매 -1 = outlier, 1 = normal

clean = X[outliers == 1]

IQR (univariate)

def iqr_outlier_remove(df, col, k=1.5):
    Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
    IQR = Q3 - Q1
    return df[(df[col] >= Q1 - k * IQR) & (df[col] <= Q3 + k * IQR)]

Deduplication (MinHash for huge data)

from datasketch import MinHash, MinHashLSH

def get_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for word in text.split():
        m.update(word.encode('utf-8'))
    return m

lsh = MinHashLSH(threshold=0.8, num_perm=128)
seen = []
unique_docs = []

for i, doc in enumerate(corpus):
    m = get_minhash(doc)
    if not lsh.query(m):
        lsh.insert(f'doc{i}', m)
        unique_docs.append(doc)

Fuzzy match (Levenshtein)

from rapidfuzz import fuzz, process

# 매 매 record matching
matches = process.extract('John Smith', candidates, scorer=fuzz.ratio, limit=5)

Schema validation (Pandera)

import pandera as pa

schema = pa.DataFrameSchema({
    'user_id': pa.Column(int, checks=pa.Check.greater_than(0)),
    'email': pa.Column(str, checks=pa.Check.str_matches(r'.+@.+\..+')),
    'age': pa.Column(int, checks=[
        pa.Check.greater_than_or_equal_to(0),
        pa.Check.less_than(150),
    ]),
})

clean_df = schema.validate(df)

Great Expectations (data quality framework)

import great_expectations as ge

context = ge.get_context()
suite = context.create_expectation_suite('user_data')

batch = context.get_validator(
    batch_request=...,
    expectation_suite=suite,
)

batch.expect_column_values_to_not_be_null('user_id')
batch.expect_column_values_to_be_unique('user_id')
batch.expect_column_values_to_be_between('age', 0, 150)
batch.expect_column_value_lengths_to_be_between('email', 5, 100)

results = batch.validate()

LLM data curation (quality classifier)

def quality_classifier(text, threshold=0.5):
    """매 web text 의 quality 의 grade."""
    features = {
        'length_ok': 100 < len(text) < 10000,
        'punctuation_ratio': sum(c in '.,!?' for c in text) / max(len(text), 1),
        'caps_ratio': sum(c.isupper() for c in text) / max(len(text), 1),
        'has_url_spam': sum(1 for w in text.split() if w.startswith('http')) > 5,
        'language_ok': detect_language(text) == 'en',
    }
    score = (
        features['length_ok'] * 0.3 +
        (0.005 < features['punctuation_ratio'] < 0.05) * 0.2 +
        (features['caps_ratio'] < 0.3) * 0.2 +
        (not features['has_url_spam']) * 0.2 +
        features['language_ok'] * 0.1
    )
    return score > threshold

PII detection / removal

import re

PATTERNS = {
    'email': r'[\w.+-]+@[\w-]+\.[\w.-]+',
    'phone': r'\+?[\d\s().-]{10,}',
    'ssn': r'\d{3}-\d{2}-\d{4}',
    'credit_card': r'\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}',
}

def redact_pii(text):
    for name, pattern in PATTERNS.items():
        text = re.sub(pattern, f'[{name.upper()}_REDACTED]', text)
    return text

LLM-aided cleaning

def llm_clean(text, instruction='Fix typos and standardize format'):
    return llm.generate(f"""{instruction}.
Return only the cleaned text, no explanation.

Input: {text}""")

Data version control (DVC)

# 매 매 cleaning step 의 versioned data
dvc add data/raw/users.csv
dvc add data/cleaned/users.parquet
dvc run -n clean_users \
    -d data/raw/users.csv \
    -o data/cleaned/users.parquet \
    python clean_users.py

Pipeline (Airflow / Dagster)

@dagster.asset
def cleaned_users(raw_users):
    df = raw_users
    df = remove_duplicates(df)
    df = impute_missing(df)
    df = remove_outliers(df)
    df = standardize_format(df)
    schema.validate(df)
    return df

매 결정 기준

상황 Approach
Missing < 5% Drop or simple fill
Missing 5-30% KNN / MICE
Outlier univariate IQR / Z-score
Outlier multivariate Isolation Forest
Exact dup Hash
Near dup (text) MinHash + LSH
Schema Pandera / GE
LLM data Quality classifier + dedup + PII filter

기본값: Pandas + GE schema + Isolation Forest + MinHash dedup.

🔗 Graph

🤖 LLM 활용

언제: 매 ETL pipeline. 매 ML data prep. 매 LLM training data 의 curate. 언제 X: 매 already clean source. 매 already-validated.

안티패턴

  • No schema: 매 silent drift.
  • Drop missing 의 always: 매 informative.
  • Single-method outlier: 매 false positive.
  • No version: 매 reproducibility X.
  • PII 의 leak: 매 GDPR violation.
  • Raw → train (no clean): 매 garbage learn.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — task + 매 KNN / MICE / Isolation Forest / MinHash / GE / PII code