--- id: wiki-2026-0508-feature-engineering title: Feature Engineering category: 10_Wiki/Topics status: verified canonical_id: self aliases: [FE, feature engineering, target encoding, feature crossing, feature store] duplicate_of: none source_trust_level: A confidence_score: 0.98 verification_status: applied tags: [machine-learning, feature-engineering, preprocessing, target-encoding, feature-store] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: pandas / scikit-learn / Featuretools / Feast --- # Feature Engineering ## 매 한 줄 > **"매 raw data 의 model-ready feature 의 transform"**. 매 numerical scaling, 매 categorical encoding, 매 datetime, 매 text, 매 interaction. 매 modern: 매 deep learning 의 자동 학습 BUT 매 tabular 의 still 매 critical. 매 feature store (Feast) for production. ## 매 핵심 ### 매 numerical - **Scaling**: standard, minmax, robust. - **Power**: log, Box-Cox, Yeo-Johnson. - **Bin / discretize**: equal-width, quantile. - **Polynomial / interaction**. ### 매 categorical - **One-hot**: 매 low cardinality. - **Label / ordinal**: 매 ordered. - **Target encoding** (mean): 매 high cardinality + leakage care. - **Hashing trick**: 매 fixed dim. - **Embedding**: 매 NN. ### 매 datetime - **Cyclic** (sin/cos for hour/day). - **Lag features** (time series). - **Rolling stats**. - **Holiday / weekend**. ### 매 text - **Bag of words / TF-IDF**. - **N-grams**. - **Embeddings** (BERT, sentence-transformers). - **LLM features**. ### 매 응용 1. **Tabular ML**: 매 critical. 2. **Time series**: 매 lag / rolling. 3. **NLP**: 매 embed. 4. **Graph**: 매 graph features (node degree, ...). ## 💻 패턴 ### Standard scale ```python from sklearn.preprocessing import StandardScaler scaler = StandardScaler().fit(X_train) X_scaled = scaler.transform(X_test) ``` ### Cyclic datetime ```python import numpy as np def encode_cyclic(value, max_value): return np.sin(2 * np.pi * value / max_value), np.cos(2 * np.pi * value / max_value) df['hour_sin'], df['hour_cos'] = encode_cyclic(df.hour, 24) df['dow_sin'], df['dow_cos'] = encode_cyclic(df.day_of_week, 7) ``` ### Target encoding (with smoothing) ```python def target_encode(train, test, col, target, smoothing=10): global_mean = train[target].mean() agg = train.groupby(col)[target].agg(['mean', 'count']) smoothed = (agg['count'] * agg['mean'] + smoothing * global_mean) / (agg['count'] + smoothing) return test[col].map(smoothed).fillna(global_mean) ``` ### Out-of-fold target encoding (no leakage) ```python from sklearn.model_selection import KFold def oof_target_encode(X, y, col, n_folds=5): enc = np.zeros(len(X)) for tr_idx, val_idx in KFold(n_folds, shuffle=True).split(X): means = X.iloc[tr_idx].groupby(col).apply(lambda g: y.iloc[g.index].mean()) enc[val_idx] = X.iloc[val_idx][col].map(means).fillna(y.iloc[tr_idx].mean()) return enc ``` ### Lag features (time series) ```python def lag_features(df, target_col, lags=[1, 7, 30]): for lag in lags: df[f'{target_col}_lag{lag}'] = df[target_col].shift(lag) return df ``` ### Rolling stats ```python df['amt_roll_mean_7'] = df.groupby('user_id')['amount'].transform( lambda s: s.rolling(7, min_periods=1).mean() ) df['amt_roll_std_7'] = df.groupby('user_id')['amount'].transform( lambda s: s.rolling(7, min_periods=1).std() ) ``` ### Aggregation per group ```python agg = df.groupby('user_id')['amount'].agg(['mean', 'std', 'max', 'count']).reset_index() df = df.merge(agg, on='user_id', suffixes=('', '_agg')) ``` ### Interaction (cross) ```python from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) X_inter = poly.fit_transform(X[['age', 'income']]) ``` ### Hashing trick (high-card) ```python from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=2**18, input_type='string') X_hashed = h.transform([str(row.user_id) for _, row in df.iterrows()]) ``` ### Featuretools (automated) ```python import featuretools as ft es = ft.EntitySet('shop') es.add_dataframe('orders', df=orders_df, index='order_id', time_index='date') es.add_dataframe('users', df=users_df, index='user_id') es.add_relationship('users', 'user_id', 'orders', 'user_id') features, defs = ft.dfs(entityset=es, target_dataframe_name='users', agg_primitives=['mean', 'sum', 'count']) ``` ### Feast feature store (production) ```python from feast import FeatureStore, Entity, FeatureView, Field from feast.types import Float32, Int64 user = Entity(name='user', value_type=Int64) user_features = FeatureView( name='user_features', entities=[user], ttl=timedelta(days=1), schema=[Field(name='ltv', dtype=Float32), Field(name='tenure', dtype=Int64)], source=BigQuerySource(table='proj.user_features'), ) store = FeatureStore(repo_path='.') features = store.get_online_features(features=['user_features:ltv'], entity_rows=[{'user': 1}]) ``` ### LLM-as-feature ```python def llm_sentiment(text, llm): return llm.classify(text, ['positive', 'neutral', 'negative']) df['llm_sentiment'] = df['review'].apply(lambda t: llm_sentiment(t, llm)) ``` ### Embedding feature ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') df['title_emb'] = list(model.encode(df['title'].tolist())) ``` ### Pipeline (sklearn) ```python from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer preprocessor = ColumnTransformer([ ('num', StandardScaler(), num_cols), ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols), ]) pipe = Pipeline([('prep', preprocessor), ('model', xgb.XGBClassifier())]) pipe.fit(X_train, y_train) ``` ### Anti-leakage ```python def split_then_fit(X, y): """매 ALWAYS split first 의 fit transformer.""" X_tr, X_val, y_tr, y_val = train_test_split(X, y) scaler = StandardScaler().fit(X_tr) # 매 train only X_tr_s = scaler.transform(X_tr) X_val_s = scaler.transform(X_val) # 매 val 의 transform only return X_tr_s, X_val_s, y_tr, y_val ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Numerical + tree | Often raw OK | | Numerical + linear | Scale (Standard) | | Cardinality < 50 | One-hot | | Cardinality > 50 | Target encode (OOF) or hash | | Time series | Lag + rolling | | Production | Feature store (Feast) | | Auto | Featuretools / Tsfresh | **기본값**: 매 manual + 매 OOF target encode + 매 cyclic datetime + 매 leakage prevent + 매 production = feature store. ## 🔗 Graph - 부모: [[Machine-Learning]] · [[Data-Preprocessing]] - 변형: [[Target-Encoding]] · [[Embeddings]] - 응용: [[Feature-Store]] · [[Feast]] - Adjacent: [[Exploratory-Data-Analysis]] ## 🤖 LLM 활용 **언제**: 매 tabular ML. 매 time series. 매 production system. **언제 X**: 매 deep learning end-to-end (image, text). ## ❌ 안티패턴 - **Fit on full data**: 매 leakage. - **Naive target encode**: 매 leakage. - **No cyclic datetime**: 매 RNN-only. - **Skip feature store**: 매 prod / train skew. - **Over-engineer for tree**: 매 little gain. ## 🧪 검증 / 중복 - Verified (Kuhn Feature Engineering, Kaggle competitions, Feast docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-04-20 | Auto-reinforced | | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — categorical / time / interaction + 매 OOF / Featuretools / Feast code |