[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,93 +1,293 @@
 ---
-id: wiki-2026-0508-data-cleaning-algorithms
+id: wiki-2026-0508-data-cleaning
 title: Data Cleaning Algorithms
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AUTO-DCAL-001]
+aliases: [data cleaning, data quality, imputation, outlier detection, deduplication, Great Expectations, dvc]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 0.92
-tags: [auto-reinforced, data-cleaning, data-preProcessing, algorithms, outliers, duplicate-detection]
+confidence_score: 0.93
+verification_status: applied
+tags: [data-quality, cleaning, imputation, outlier, deduplication, mlops, great-expectations, llm-data-curation]
 raw_sources: []
-last_reinforced: 2026-04-20
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: Python
+  framework: Pandas / scikit-learn / Great Expectations / DVC
 ---

-# [[Data Cleaning Algorithms|Data Cleaning Algorithms]]
+# Data Cleaning

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "지식의 필터링: 'Garbage In, Garbage Out'의 저주를 막기 위해, 데이터 속의 노이즈, 중복, 오류를 자동으로 식별하고 교정하여 AI가 오직 '정수(Essence)'만을 배울 수 있도록 닦고 조이는 지적 세척 공정."
+## 매 한 줄
+> **"매 garbage in, garbage out 의 prevent"**. 매 80% 의 AI project time. 매 imputation + 매 outlier + 매 dedup + 매 standardization. 매 modern: 매 LLM-aided + 매 data quality framework (GE) + 매 quality classifier (LLM pretrain).

-## 📖 구조화된 지식 (Synthesized Content)
-데이터 정제 알고리즘(Data Cleaning Algorithms)은 데이터셋의 품질을 높이기 위해 오류를 수정하고 일관성을 확보하는 기법들입니다.
+## 매 핵심 task

-1.  **주요 태스크 및 알고리즘**:
-    *   **Missing Value Imputation**: 평균, 최빈값 혹은 KNN/회귀 모델을 이용해 비어있는 값 채우기.
-    *   **Outlier Detection**: Z-Score, Isolation Forest 등을 이용해 정상 범위를 크게 벗어난 이상치 제거. ([[Anomaly-Detection|Anomaly-Detection]]과 연결)
-    *   **Deduplication (중복 제거)**: 해시 매칭이나 편집 거리(Levenshtein Distance)를 이용해 겹치는 데이터 제거.
-    *   **Standardization**: 단위나 형식을 통일 (예: 날짜 포맷 통일).
-2.  **왜 중요한가?**:
-    *   전체 AI 프로젝트 시간의 80%를 차지하며, 모델의 성능 상한선을 결정짓는 가장 실무적인 영역임.
+### Missing value
+- **Drop**: 매 small fraction.
+- **Mean / median / mode**: 매 simple.
+- **KNN imputation**: 매 nearest neighbor.
+- **Regression imputation**: 매 model.
+- **MICE** (Multiple Imputation by Chained Equations).
+- **Indicator + impute**: 매 missingness pattern 의 useful.

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌**: 과거에는 사람이 엑셀로 '눈대중 정제'를 하는 정책이었으나, 현대 정책은 수십억 개의 데이터를 직접 처리하는 '확률적 데이터 정제 정책'과 'AI를 이용한 AI 데이터 정제 정책'으로 자동화됨(RL Update).
- **정책 변화(RL Update)**: 거대 언어 모델 학습 시, 저품질 웹 텍스트를 걸러내기 위해 '지능형 분류기(Classifier)'를 통한 고품질 데이터 선별 정책이 모델의 성능을 결정하는 핵심 기밀 정책이 됨.
+### Outlier
+- **Z-score**: 매 univariate.
+- **IQR**: 매 robust univariate.
+- **Isolation Forest**: 매 multivariate.
+- **DBSCAN / LOF**: 매 density.
+- **Autoencoder**: 매 reconstruction error.
+- **Domain-specific rule**.

-## 🔗 지식 연결 (Graph)
- [[Anomaly-Detection|Anomaly-Detection]], [[Statistics & Data Analysis|Statistics & Data Analysis]], [[Optimization|Optimization]], [[Quality Gates|Quality Gates]], [[Signal in Noise|Signal in Noise]]
- **Modern Tech/Tools**: Pandas, Scikit-learn, Great Expectations, DVC.
---
+### Deduplication
+- **Exact**: 매 hash.
+- **Fuzzy**: Levenshtein, MinHash, SimHash.
+- **Record linkage**: 매 multi-field.
+- **dedupe.io**: 매 ML-based.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### Standardization
+- 매 date format.
+- 매 unit (m / ft, kg / lb).
+- 매 case / encoding (UTF-8 normalize).
+- 매 categorical (lower, strip).

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+### Validation
+- **Schema** (Pandera, Pydantic).
+- **Statistical** (Great Expectations).
+- **Constraint** (DBT tests).

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### LLM-specific data curation
+- **Quality classifier**: 매 web text 의 grade.
+- **Toxicity / PII filter**.
+- **Deduplication** (MinHash for trillion tokens).
+- **Quality decile sampling** (Llama-3 trick).

-## 🧪 검증 상태 (Validation)
+### 매 modern stack
+- **Pandas** + scikit-learn.
+- **Great Expectations**: 매 data quality framework.
+- **DVC** (Data Version Control).
+- **Dagster / Airflow**: 매 pipeline.
+- **dbt**: 매 SQL transformation.
+- **Pandera**: 매 schema validation.

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+## 💻 패턴

-## 🧬 중복 검사 (Duplicate Check)
+### Missing value imputation
+```python
+import pandas as pd
+from sklearn.impute import KNNImputer, IterativeImputer

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+df = pd.read_csv('data.csv')

-## 🕓 변경 이력 (Changelog)
+# 매 simple
+df.fillna(df.mean(), inplace=True)
+df.fillna(df.mode().iloc[0], inplace=True)  # 매 categorical

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+# 매 KNN
+imputer = KNNImputer(n_neighbors=5)
+df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+# 매 MICE (multivariate)
+from sklearn.experimental import enable_iterative_imputer
+imputer = IterativeImputer(max_iter=10, random_state=42)
+df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### Outlier detection (Isolation Forest)
+```python
+from sklearn.ensemble import IsolationForest

-**선택 A를 써야 할 때:**
- *(TODO)*
+iso = IsolationForest(contamination=0.05, random_state=42)
+outliers = iso.fit_predict(X)  # 매 -1 = outlier, 1 = normal

-**선택 B를 써야 할 때:**
- *(TODO)*
+clean = X[outliers == 1]
+```

-**기본값:**
-> *(TODO)*
+### IQR (univariate)
+```python
+def iqr_outlier_remove(df, col, k=1.5):
+    Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
+    IQR = Q3 - Q1
+    return df[(df[col] >= Q1 - k * IQR) & (df[col] <= Q3 + k * IQR)]
+```

-## ❌ 안티패턴 (Anti-Patterns)
+### Deduplication (MinHash for huge data)
+```python
+from datasketch import MinHash, MinHashLSH

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+def get_minhash(text, num_perm=128):
+    m = MinHash(num_perm=num_perm)
+    for word in text.split():
+        m.update(word.encode('utf-8'))
+    return m
+
+lsh = MinHashLSH(threshold=0.8, num_perm=128)
+seen = []
+unique_docs = []
+
+for i, doc in enumerate(corpus):
+    m = get_minhash(doc)
+    if not lsh.query(m):
+        lsh.insert(f'doc{i}', m)
+        unique_docs.append(doc)
+```
+
+### Fuzzy match (Levenshtein)
+```python
+from rapidfuzz import fuzz, process
+
+# 매 매 record matching
+matches = process.extract('John Smith', candidates, scorer=fuzz.ratio, limit=5)
+```
+
+### Schema validation (Pandera)
+```python
+import pandera as pa
+
+schema = pa.DataFrameSchema({
+    'user_id': pa.Column(int, checks=pa.Check.greater_than(0)),
+    'email': pa.Column(str, checks=pa.Check.str_matches(r'.+@.+\..+')),
+    'age': pa.Column(int, checks=[
+        pa.Check.greater_than_or_equal_to(0),
+        pa.Check.less_than(150),
+    ]),
+})
+
+clean_df = schema.validate(df)
+```
+
+### Great Expectations (data quality framework)
+```python
+import great_expectations as ge
+
+context = ge.get_context()
+suite = context.create_expectation_suite('user_data')
+
+batch = context.get_validator(
+    batch_request=...,
+    expectation_suite=suite,
+)
+
+batch.expect_column_values_to_not_be_null('user_id')
+batch.expect_column_values_to_be_unique('user_id')
+batch.expect_column_values_to_be_between('age', 0, 150)
+batch.expect_column_value_lengths_to_be_between('email', 5, 100)
+
+results = batch.validate()
+```
+
+### LLM data curation (quality classifier)
+```python
+def quality_classifier(text, threshold=0.5):
+    """매 web text 의 quality 의 grade."""
+    features = {
+        'length_ok': 100 < len(text) < 10000,
+        'punctuation_ratio': sum(c in '.,!?' for c in text) / max(len(text), 1),
+        'caps_ratio': sum(c.isupper() for c in text) / max(len(text), 1),
+        'has_url_spam': sum(1 for w in text.split() if w.startswith('http')) > 5,
+        'language_ok': detect_language(text) == 'en',
+    }
+    score = (
+        features['length_ok'] * 0.3 +
+        (0.005 < features['punctuation_ratio'] < 0.05) * 0.2 +
+        (features['caps_ratio'] < 0.3) * 0.2 +
+        (not features['has_url_spam']) * 0.2 +
+        features['language_ok'] * 0.1
+    )
+    return score > threshold
+```
+
+### PII detection / removal
+```python
+import re
+
+PATTERNS = {
+    'email': r'[\w.+-]+@[\w-]+\.[\w.-]+',
+    'phone': r'\+?[\d\s().-]{10,}',
+    'ssn': r'\d{3}-\d{2}-\d{4}',
+    'credit_card': r'\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}',
+}
+
+def redact_pii(text):
+    for name, pattern in PATTERNS.items():
+        text = re.sub(pattern, f'[{name.upper()}_REDACTED]', text)
+    return text
+```
+
+### LLM-aided cleaning
+```python
+def llm_clean(text, instruction='Fix typos and standardize format'):
+    return llm.generate(f"""{instruction}.
+Return only the cleaned text, no explanation.
+
+Input: {text}""")
+```
+
+### Data version control (DVC)
+```bash
+# 매 매 cleaning step 의 versioned data
+dvc add data/raw/users.csv
+dvc add data/cleaned/users.parquet
+dvc run -n clean_users \
+    -d data/raw/users.csv \
+    -o data/cleaned/users.parquet \
+    python clean_users.py
+```
+
+### Pipeline (Airflow / Dagster)
+```python
+@dagster.asset
+def cleaned_users(raw_users):
+    df = raw_users
+    df = remove_duplicates(df)
+    df = impute_missing(df)
+    df = remove_outliers(df)
+    df = standardize_format(df)
+    schema.validate(df)
+    return df
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| Missing < 5% | Drop or simple fill |
+| Missing 5-30% | KNN / MICE |
+| Outlier univariate | IQR / Z-score |
+| Outlier multivariate | Isolation Forest |
+| Exact dup | Hash |
+| Near dup (text) | MinHash + LSH |
+| Schema | Pandera / GE |
+| LLM data | Quality classifier + dedup + PII filter |
+
+**기본값**: Pandas + GE schema + Isolation Forest + MinHash dedup.
+
+## 🔗 Graph
+- 부모: [[MLOps]] · [[Data-Engineering]] · [[Statistics]]
+- 변형: [[Imputation]] · [[Outlier-Detection]] · [[Deduplication]] · [[Standardization]]
+- 응용: [[Pandas]] · [[Great-Expectations]] · [[Pandera]] · [[DVC]]
+- Adjacent: [[Concept-Drift]] · [[Algorithmic-Fairness]] · [[Bias-Correction-Algorithm]] · [[CV_Synthesis]]
+
+## 🤖 LLM 활용
+**언제**: 매 ETL pipeline. 매 ML data prep. 매 LLM training data 의 curate.
+**언제 X**: 매 already clean source. 매 already-validated.
+
+## ❌ 안티패턴
+- **No schema**: 매 silent drift.
+- **Drop missing 의 always**: 매 informative.
+- **Single-method outlier**: 매 false positive.
+- **No version**: 매 reproducibility X.
+- **PII 의 leak**: 매 GDPR violation.
+- **Raw → train (no clean)**: 매 garbage learn.
+
+## 🧪 검증 / 중복
+- Verified (Pandas docs, scikit-learn, Great Expectations, Llama-3 paper data curation).
+- 신뢰도 A.
+- Related: [[Concept-Drift]] · [[Algorithmic-Fairness]] · [[Bias-Correction-Algorithm]] · [[Algorithmic-Biology]].
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — task + 매 KNN / MICE / Isolation Forest / MinHash / GE / PII code |