--- id: wiki-2026-0508-pre-processing-data-for-ai title: Pre-processing Data for AI category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Data Preprocessing, Feature Engineering, Data Cleaning] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [data-preprocessing, feature-engineering, ml, sklearn, pipeline] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: scikit-learn, pandas, polars --- # Pre-processing Data for AI ## 매 한 줄 > **"매 raw data를 model-consumable form으로 변환 — clean, scale, encode, impute."**. ML pipeline의 80% 시간이 매 여기에 소요. 매 sklearn `Pipeline` + `ColumnTransformer` 가 표준, modern stack은 polars + sklearn 또는 PyTorch `Dataset` 내 transform. ## 매 핵심 ### 매 단계 1. **Cleaning**: duplicate 제거, type 정정, outlier 처리. 2. **Missing imputation**: mean/median/mode, KNN, MICE, model-based. 3. **Encoding**: categorical → numeric (one-hot, target, ordinal, embedding). 4. **Scaling**: numeric range 정규화 (standard, minmax, robust). 5. **Feature engineering**: domain feature, interaction, polynomial, time lag. 6. **Splitting**: train/val/test — 매 leak 방지가 핵심. ### 매 leakage 방지 원칙 - Fit transform on **train only**, apply on val/test. - 매 sklearn `Pipeline` 안에 모두 포함 — cross-val 안전. - Time-series는 매 chronological split. ### 매 응용 1. Tabular ML (XGBoost / LightGBM / CatBoost) 매 input 준비. 2. NLP tokenization + truncation + padding. 3. Vision augmentation (flip, crop, mixup, RandAugment). 4. Time-series feature lag, rolling stat. ## 💻 패턴 ### sklearn Pipeline + ColumnTransformer (canonical) ```python import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier num = ['age', 'income'] cat = ['city', 'plan'] num_pipe = Pipeline([ ('imp', SimpleImputer(strategy='median')), ('sc', StandardScaler()), ]) cat_pipe = Pipeline([ ('imp', SimpleImputer(strategy='most_frequent')), ('oh', OneHotEncoder(handle_unknown='ignore')), ]) pre = ColumnTransformer([('num', num_pipe, num), ('cat', cat_pipe, cat)]) clf = Pipeline([('pre', pre), ('m', GradientBoostingClassifier())]) clf.fit(X_train, y_train) ``` ### Train/val/test split (no leak) ```python from sklearn.model_selection import train_test_split X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) X_tr, X_val, y_tr, y_val = train_test_split(X_tr, y_tr, test_size=0.2, stratify=y_tr, random_state=42) ``` ### Target encoding (high-cardinality categorical) ```python from category_encoders import TargetEncoder te = TargetEncoder(cols=['zip_code']) X_train_enc = te.fit_transform(X_train, y_train) X_test_enc = te.transform(X_test) ``` ### Outlier handling (winsorize + RobustScaler) ```python from scipy.stats import mstats import numpy as np X['amount'] = mstats.winsorize(X['amount'], limits=[0.01, 0.01]) from sklearn.preprocessing import RobustScaler X[['amount']] = RobustScaler().fit_transform(X[['amount']]) ``` ### Time-series lag / rolling features (polars) ```python import polars as pl df = ( df.sort('ts') .with_columns([ pl.col('y').shift(1).alias('y_lag1'), pl.col('y').shift(7).alias('y_lag7'), pl.col('y').rolling_mean(window_size=7).alias('y_ma7'), ]) ) ``` ### Image augmentation (torchvision v2) ```python from torchvision.transforms import v2 import torch train_tf = v2.Compose([ v2.RandomResizedCrop(224, antialias=True), v2.RandomHorizontalFlip(), v2.RandAugment(), v2.ToDtype(torch.float32, scale=True), v2.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]), ]) ``` ### KNN imputer (correlated missing) ```python from sklearn.impute import KNNImputer imp = KNNImputer(n_neighbors=5) X_imp = imp.fit_transform(X) ``` ### Iterative (MICE) imputer ```python from sklearn.experimental import enable_iterative_imputer # noqa from sklearn.impute import IterativeImputer imp = IterativeImputer(max_iter=10, random_state=0) X_imp = imp.fit_transform(X) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Numeric, gaussian-like | StandardScaler | | Numeric with outliers | RobustScaler 또는 winsorize | | Numeric bounded [0,1] 필요 | MinMaxScaler | | Low-cardinality categorical | OneHotEncoder | | High-cardinality categorical | Target encoding 또는 embedding | | Tree-based (XGBoost) | scaling 불필요, encoding은 ordinal/native cat OK | | Time-series | lag/rolling feature, chronological split | | Image | torchvision/timm augmentation | **기본값**: tabular는 sklearn Pipeline + ColumnTransformer, vision은 torchvision v2. ## 🔗 Graph - 부모: [[Machine_Learning]] · [[Data_Engineering]] - 변형: [[Feature_Engineering]] · [[Feature_Scaling]] - Adjacent: [[Data_Cleaning]] · [[Imbalanced_Data]] ## 🤖 LLM 활용 **언제**: tabular ML 매 input prep, time-series feature gen, image/text augmentation pipeline 설계. **언제 X**: 매 modern deep learning에서 raw input → end-to-end (CNN/Transformer가 매 representation 학습) — pretrained 사용 시 normalization만. ## ❌ 안티패턴 - **Fit scaler on full data before split**: 매 leak — val/test 정보가 train scaler에 누출. - **One-hot 1000+ category**: 매 sparse 폭발. Target encoding 또는 embedding. - **Drop all NaN rows**: 매 loss huge. Imputation 또는 missing indicator. - **Standardize tree-based input**: 매 pointless — tree는 scale invariant. - **Same augmentation on val**: 매 train만 augment, val/test deterministic. ## 🧪 검증 / 중복 - Verified (sklearn user guide preprocessing, pandas docs, polars docs, torchvision v2 transforms). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — preprocessing stages + sklearn Pipeline + leakage rules |