d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8.3 KiB
8.3 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-data-cleaning | Data Cleaning Algorithms | 10_Wiki/Topics | verified | self |
|
none | A | 0.93 | applied |
|
2026-05-10 | pending |
|
Data Cleaning
매 한 줄
"매 garbage in, garbage out 의 prevent". 매 80% 의 AI project time. 매 imputation + 매 outlier + 매 dedup + 매 standardization. 매 modern: 매 LLM-aided + 매 data quality framework (GE) + 매 quality classifier (LLM pretrain).
매 핵심 task
Missing value
- Drop: 매 small fraction.
- Mean / median / mode: 매 simple.
- KNN imputation: 매 nearest neighbor.
- Regression imputation: 매 model.
- MICE (Multiple Imputation by Chained Equations).
- Indicator + impute: 매 missingness pattern 의 useful.
Outlier
- Z-score: 매 univariate.
- IQR: 매 robust univariate.
- Isolation Forest: 매 multivariate.
- DBSCAN / LOF: 매 density.
- Autoencoder: 매 reconstruction error.
- Domain-specific rule.
Deduplication
- Exact: 매 hash.
- Fuzzy: Levenshtein, MinHash, SimHash.
- Record linkage: 매 multi-field.
- dedupe.io: 매 ML-based.
Standardization
- 매 date format.
- 매 unit (m / ft, kg / lb).
- 매 case / encoding (UTF-8 normalize).
- 매 categorical (lower, strip).
Validation
- Schema (Pandera, Pydantic).
- Statistical (Great Expectations).
- Constraint (DBT tests).
LLM-specific data curation
- Quality classifier: 매 web text 의 grade.
- Toxicity / PII filter.
- Deduplication (MinHash for trillion tokens).
- Quality decile sampling (Llama-3 trick).
매 modern stack
- Pandas + scikit-learn.
- Great Expectations: 매 data quality framework.
- DVC (Data Version Control).
- Dagster / Airflow: 매 pipeline.
- dbt: 매 SQL transformation.
- Pandera: 매 schema validation.
💻 패턴
Missing value imputation
import pandas as pd
from sklearn.impute import KNNImputer, IterativeImputer
df = pd.read_csv('data.csv')
# 매 simple
df.fillna(df.mean(), inplace=True)
df.fillna(df.mode().iloc[0], inplace=True) # 매 categorical
# 매 KNN
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
# 매 MICE (multivariate)
from sklearn.experimental import enable_iterative_imputer
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
Outlier detection (Isolation Forest)
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05, random_state=42)
outliers = iso.fit_predict(X) # 매 -1 = outlier, 1 = normal
clean = X[outliers == 1]
IQR (univariate)
def iqr_outlier_remove(df, col, k=1.5):
Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
IQR = Q3 - Q1
return df[(df[col] >= Q1 - k * IQR) & (df[col] <= Q3 + k * IQR)]
Deduplication (MinHash for huge data)
from datasketch import MinHash, MinHashLSH
def get_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for word in text.split():
m.update(word.encode('utf-8'))
return m
lsh = MinHashLSH(threshold=0.8, num_perm=128)
seen = []
unique_docs = []
for i, doc in enumerate(corpus):
m = get_minhash(doc)
if not lsh.query(m):
lsh.insert(f'doc{i}', m)
unique_docs.append(doc)
Fuzzy match (Levenshtein)
from rapidfuzz import fuzz, process
# 매 매 record matching
matches = process.extract('John Smith', candidates, scorer=fuzz.ratio, limit=5)
Schema validation (Pandera)
import pandera as pa
schema = pa.DataFrameSchema({
'user_id': pa.Column(int, checks=pa.Check.greater_than(0)),
'email': pa.Column(str, checks=pa.Check.str_matches(r'.+@.+\..+')),
'age': pa.Column(int, checks=[
pa.Check.greater_than_or_equal_to(0),
pa.Check.less_than(150),
]),
})
clean_df = schema.validate(df)
Great Expectations (data quality framework)
import great_expectations as ge
context = ge.get_context()
suite = context.create_expectation_suite('user_data')
batch = context.get_validator(
batch_request=...,
expectation_suite=suite,
)
batch.expect_column_values_to_not_be_null('user_id')
batch.expect_column_values_to_be_unique('user_id')
batch.expect_column_values_to_be_between('age', 0, 150)
batch.expect_column_value_lengths_to_be_between('email', 5, 100)
results = batch.validate()
LLM data curation (quality classifier)
def quality_classifier(text, threshold=0.5):
"""매 web text 의 quality 의 grade."""
features = {
'length_ok': 100 < len(text) < 10000,
'punctuation_ratio': sum(c in '.,!?' for c in text) / max(len(text), 1),
'caps_ratio': sum(c.isupper() for c in text) / max(len(text), 1),
'has_url_spam': sum(1 for w in text.split() if w.startswith('http')) > 5,
'language_ok': detect_language(text) == 'en',
}
score = (
features['length_ok'] * 0.3 +
(0.005 < features['punctuation_ratio'] < 0.05) * 0.2 +
(features['caps_ratio'] < 0.3) * 0.2 +
(not features['has_url_spam']) * 0.2 +
features['language_ok'] * 0.1
)
return score > threshold
PII detection / removal
import re
PATTERNS = {
'email': r'[\w.+-]+@[\w-]+\.[\w.-]+',
'phone': r'\+?[\d\s().-]{10,}',
'ssn': r'\d{3}-\d{2}-\d{4}',
'credit_card': r'\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}',
}
def redact_pii(text):
for name, pattern in PATTERNS.items():
text = re.sub(pattern, f'[{name.upper()}_REDACTED]', text)
return text
LLM-aided cleaning
def llm_clean(text, instruction='Fix typos and standardize format'):
return llm.generate(f"""{instruction}.
Return only the cleaned text, no explanation.
Input: {text}""")
Data version control (DVC)
# 매 매 cleaning step 의 versioned data
dvc add data/raw/users.csv
dvc add data/cleaned/users.parquet
dvc run -n clean_users \
-d data/raw/users.csv \
-o data/cleaned/users.parquet \
python clean_users.py
Pipeline (Airflow / Dagster)
@dagster.asset
def cleaned_users(raw_users):
df = raw_users
df = remove_duplicates(df)
df = impute_missing(df)
df = remove_outliers(df)
df = standardize_format(df)
schema.validate(df)
return df
매 결정 기준
| 상황 | Approach |
|---|---|
| Missing < 5% | Drop or simple fill |
| Missing 5-30% | KNN / MICE |
| Outlier univariate | IQR / Z-score |
| Outlier multivariate | Isolation Forest |
| Exact dup | Hash |
| Near dup (text) | MinHash + LSH |
| Schema | Pandera / GE |
| LLM data | Quality classifier + dedup + PII filter |
기본값: Pandas + GE schema + Isolation Forest + MinHash dedup.
🔗 Graph
- 부모: MLOps · Data-Engineering · Statistics
- 변형: Imputation · Outlier-Detection · Deduplication · Standardization
- 응용: Great-Expectations · DVC
- Adjacent: Concept-Drift · Algorithmic Fairness · Bias-Correction-Algorithm · CV_Synthesis
🤖 LLM 활용
언제: 매 ETL pipeline. 매 ML data prep. 매 LLM training data 의 curate. 언제 X: 매 already clean source. 매 already-validated.
❌ 안티패턴
- No schema: 매 silent drift.
- Drop missing 의 always: 매 informative.
- Single-method outlier: 매 false positive.
- No version: 매 reproducibility X.
- PII 의 leak: 매 GDPR violation.
- Raw → train (no clean): 매 garbage learn.
🧪 검증 / 중복
- Verified (Pandas docs, scikit-learn, Great Expectations, Llama-3 paper data curation).
- 신뢰도 A.
- Related: Concept-Drift · Algorithmic Fairness · Bias-Correction-Algorithm · Algorithmic-Biology.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — task + 매 KNN / MICE / Isolation Forest / MinHash / GE / PII code |