id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id
title
category
status
canonical_id
aliases
duplicate_of
source_trust_level
confidence_score
verification_status
tags
raw_sources
last_reinforced
github_commit
tech_stack
wiki-2026-0508-pre-processing-data-for-ai
Pre-processing Data for AI
10_Wiki/Topics
verified
self
Data Preprocessing
Feature Engineering
Data Cleaning
none
A
0.9
applied
data-preprocessing
feature-engineering
ml
sklearn
pipeline
2026-05-10
pending
language
framework
python
scikit-learn, pandas, polars
Pre-processing Data for AI
매 한 줄
"매 raw data를 model-consumable form으로 변환 — clean, scale, encode, impute." . ML pipeline의 80% 시간이 매 여기에 소요. 매 sklearn Pipeline + ColumnTransformer 가 표준, modern stack은 polars + sklearn 또는 PyTorch Dataset 내 transform.
매 핵심
매 단계
Cleaning : duplicate 제거, type 정정, outlier 처리.
Missing imputation : mean/median/mode, KNN, MICE, model-based.
Encoding : categorical → numeric (one-hot, target, ordinal, embedding).
Scaling : numeric range 정규화 (standard, minmax, robust).
Feature engineering : domain feature, interaction, polynomial, time lag.
Splitting : train/val/test — 매 leak 방지가 핵심.
매 leakage 방지 원칙
Fit transform on train only , apply on val/test.
매 sklearn Pipeline 안에 모두 포함 — cross-val 안전.
Time-series는 매 chronological split.
매 응용
Tabular ML (XGBoost / LightGBM / CatBoost) 매 input 준비.
NLP tokenization + truncation + padding.
Vision augmentation (flip, crop, mixup, RandAugment).
Time-series feature lag, rolling stat.
💻 패턴
sklearn Pipeline + ColumnTransformer (canonical)
Train/val/test split (no leak)
Target encoding (high-cardinality categorical)
Outlier handling (winsorize + RobustScaler)
Time-series lag / rolling features (polars)
Image augmentation (torchvision v2)
KNN imputer (correlated missing)
Iterative (MICE) imputer
매 결정 기준
상황
Approach
Numeric, gaussian-like
StandardScaler
Numeric with outliers
RobustScaler 또는 winsorize
Numeric bounded [0,1] 필요
MinMaxScaler
Low-cardinality categorical
OneHotEncoder
High-cardinality categorical
Target encoding 또는 embedding
Tree-based (XGBoost)
scaling 불필요, encoding은 ordinal/native cat OK
Time-series
lag/rolling feature, chronological split
Image
torchvision/timm augmentation
기본값 : tabular는 sklearn Pipeline + ColumnTransformer, vision은 torchvision v2.
🔗 Graph
🤖 LLM 활용
언제 : tabular ML 매 input prep, time-series feature gen, image/text augmentation pipeline 설계.
언제 X : 매 modern deep learning에서 raw input → end-to-end (CNN/Transformer가 매 representation 학습) — pretrained 사용 시 normalization만.
❌ 안티패턴
Fit scaler on full data before split : 매 leak — val/test 정보가 train scaler에 누출.
One-hot 1000+ category : 매 sparse 폭발. Target encoding 또는 embedding.
Drop all NaN rows : 매 loss huge. Imputation 또는 missing indicator.
Standardize tree-based input : 매 pointless — tree는 scale invariant.
Same augmentation on val : 매 train만 augment, val/test deterministic.
🧪 검증 / 중복
Verified (sklearn user guide preprocessing, pandas docs, polars docs, torchvision v2 transforms).
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — preprocessing stages + sklearn Pipeline + leakage rules