Files
2nd/10_Wiki/Topics/AI_and_ML/Pre-processing-Data-for-AI.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

6.1 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-pre-processing-data-for-ai Pre-processing Data for AI 10_Wiki/Topics verified self
Data Preprocessing
Feature Engineering
Data Cleaning
none A 0.9 applied
data-preprocessing
feature-engineering
ml
sklearn
pipeline
2026-05-10 pending
language framework
python scikit-learn, pandas, polars

Pre-processing Data for AI

매 한 줄

"매 raw data를 model-consumable form으로 변환 — clean, scale, encode, impute.". ML pipeline의 80% 시간이 매 여기에 소요. 매 sklearn Pipeline + ColumnTransformer 가 표준, modern stack은 polars + sklearn 또는 PyTorch Dataset 내 transform.

매 핵심

매 단계

  1. Cleaning: duplicate 제거, type 정정, outlier 처리.
  2. Missing imputation: mean/median/mode, KNN, MICE, model-based.
  3. Encoding: categorical → numeric (one-hot, target, ordinal, embedding).
  4. Scaling: numeric range 정규화 (standard, minmax, robust).
  5. Feature engineering: domain feature, interaction, polynomial, time lag.
  6. Splitting: train/val/test — 매 leak 방지가 핵심.

매 leakage 방지 원칙

  • Fit transform on train only, apply on val/test.
  • 매 sklearn Pipeline 안에 모두 포함 — cross-val 안전.
  • Time-series는 매 chronological split.

매 응용

  1. Tabular ML (XGBoost / LightGBM / CatBoost) 매 input 준비.
  2. NLP tokenization + truncation + padding.
  3. Vision augmentation (flip, crop, mixup, RandAugment).
  4. Time-series feature lag, rolling stat.

💻 패턴

sklearn Pipeline + ColumnTransformer (canonical)

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier

num = ['age', 'income']
cat = ['city', 'plan']

num_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='median')),
    ('sc',  StandardScaler()),
])
cat_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('oh',  OneHotEncoder(handle_unknown='ignore')),
])
pre = ColumnTransformer([('num', num_pipe, num), ('cat', cat_pipe, cat)])

clf = Pipeline([('pre', pre), ('m', GradientBoostingClassifier())])
clf.fit(X_train, y_train)

Train/val/test split (no leak)

from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2,
                                          stratify=y, random_state=42)
X_tr, X_val, y_tr, y_val = train_test_split(X_tr, y_tr, test_size=0.2,
                                            stratify=y_tr, random_state=42)

Target encoding (high-cardinality categorical)

from category_encoders import TargetEncoder
te = TargetEncoder(cols=['zip_code'])
X_train_enc = te.fit_transform(X_train, y_train)
X_test_enc  = te.transform(X_test)

Outlier handling (winsorize + RobustScaler)

from scipy.stats import mstats
import numpy as np
X['amount'] = mstats.winsorize(X['amount'], limits=[0.01, 0.01])

from sklearn.preprocessing import RobustScaler
X[['amount']] = RobustScaler().fit_transform(X[['amount']])

Time-series lag / rolling features (polars)

import polars as pl
df = (
    df.sort('ts')
      .with_columns([
          pl.col('y').shift(1).alias('y_lag1'),
          pl.col('y').shift(7).alias('y_lag7'),
          pl.col('y').rolling_mean(window_size=7).alias('y_ma7'),
      ])
)

Image augmentation (torchvision v2)

from torchvision.transforms import v2
import torch
train_tf = v2.Compose([
    v2.RandomResizedCrop(224, antialias=True),
    v2.RandomHorizontalFlip(),
    v2.RandAugment(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])

KNN imputer (correlated missing)

from sklearn.impute import KNNImputer
imp = KNNImputer(n_neighbors=5)
X_imp = imp.fit_transform(X)

Iterative (MICE) imputer

from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=10, random_state=0)
X_imp = imp.fit_transform(X)

매 결정 기준

상황 Approach
Numeric, gaussian-like StandardScaler
Numeric with outliers RobustScaler 또는 winsorize
Numeric bounded [0,1] 필요 MinMaxScaler
Low-cardinality categorical OneHotEncoder
High-cardinality categorical Target encoding 또는 embedding
Tree-based (XGBoost) scaling 불필요, encoding은 ordinal/native cat OK
Time-series lag/rolling feature, chronological split
Image torchvision/timm augmentation

기본값: tabular는 sklearn Pipeline + ColumnTransformer, vision은 torchvision v2.

🔗 Graph

🤖 LLM 활용

언제: tabular ML 매 input prep, time-series feature gen, image/text augmentation pipeline 설계. 언제 X: 매 modern deep learning에서 raw input → end-to-end (CNN/Transformer가 매 representation 학습) — pretrained 사용 시 normalization만.

안티패턴

  • Fit scaler on full data before split: 매 leak — val/test 정보가 train scaler에 누출.
  • One-hot 1000+ category: 매 sparse 폭발. Target encoding 또는 embedding.
  • Drop all NaN rows: 매 loss huge. Imputation 또는 missing indicator.
  • Standardize tree-based input: 매 pointless — tree는 scale invariant.
  • Same augmentation on val: 매 train만 augment, val/test deterministic.

🧪 검증 / 중복

  • Verified (sklearn user guide preprocessing, pandas docs, polars docs, torchvision v2 transforms).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — preprocessing stages + sklearn Pipeline + leakage rules