Files
2nd/10_Wiki/Topics/AI_and_ML/One-Hot-Encoding.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

160 lines
5.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-one-hot-encoding
title: One-Hot Encoding
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [One-Hot, OHE, Indicator-Encoding, Dummy-Encoding]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [feature-engineering, categorical, preprocessing, sklearn, pandas]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: sklearn-pandas
---
# One-Hot Encoding
## 매 한 줄
> **"매 categorical value → orthogonal binary vector"**. One-hot encoding 은 K 개 category 를 K 개 0/1 column 으로 펼치는 매 가장 단순한 categorical → numeric 변환. 매 linear model / tree-based model 의 default, 그러나 high-cardinality 에서는 target / hash encoding 으로 교체.
## 매 핵심
### 매 정의
- category set `{A, B, C}` → vectors `(1,0,0), (0,1,0), (0,0,1)`.
- ordinal encoding (0,1,2) 와 달리 **순서 가정 없음**.
- linear / kernel model 의 가정 (numeric distance) 을 깨지 않음.
### 매 dummy variable trap
- K columns → 1 redundant (sum=1 의 collinearity).
- linear regression 의 unregularized 경우 → drop_first=True.
- tree / regularized model (Lasso, Ridge) → 매 전체 K 유지 가능.
### 매 cardinality 의 문제
- high-cardinality (>50): sparse matrix 폭발, leak 위험.
- 대안: target / mean encoding, hashing trick, embedding.
### 매 응용
1. tabular ML 의 categorical preprocessing.
2. NLP token → vocab vector (sparse).
3. RL action / state space 의 discrete encoding.
## 💻 패턴
### sklearn OneHotEncoder
```python
from sklearn.preprocessing import OneHotEncoder
import numpy as np
X = np.array([["red"], ["blue"], ["green"], ["red"]])
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
enc.fit(X)
print(enc.transform([["red"], ["yellow"]]))
# [[0. 0. 1.]
# [0. 0. 0.]] <- unknown -> all zeros
print(enc.get_feature_names_out()) # ['x0_blue' 'x0_green' 'x0_red']
```
### pandas get_dummies
```python
import pandas as pd
df = pd.DataFrame({"color": ["red", "blue", "green", "red"]})
ohe = pd.get_dummies(df, columns=["color"], drop_first=True, dtype=int)
# color_green color_red
# 0 0 1
# 1 0 0
# 2 1 0
# 3 0 1
```
### ColumnTransformer (production pipeline)
```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pre = ColumnTransformer([
("num", StandardScaler(), ["age", "income"]),
("cat", OneHotEncoder(handle_unknown="ignore"), ["city", "plan"]),
])
pipe = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=1000))])
pipe.fit(X_train, y_train)
```
### Sparse matrix 의 high-cardinality
```python
enc = OneHotEncoder(sparse_output=True, handle_unknown="ignore")
X_sparse = enc.fit_transform(df[["zip_code"]]) # 40k columns sparse
# scipy.sparse.csr_matrix — memory-efficient
```
### vs label encoding (decision)
```python
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
# DON'T: feed LabelEncoder output to linear model
le = LabelEncoder()
y = le.fit_transform(["red", "blue", "green"]) # [2, 0, 1] — fake order!
# DO: OrdinalEncoder when order is real
oe = OrdinalEncoder(categories=[["low", "med", "high"]])
```
### Frequency / target encoding (high-cardinality 대안)
```python
import category_encoders as ce
te = ce.TargetEncoder(cols=["city"], smoothing=10)
X_tr = te.fit_transform(X_train, y_train)
X_te = te.transform(X_test)
```
### Hashing trick (constant memory)
```python
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=256, input_type="string")
X_h = h.transform([["zip=" + z] for z in df["zip_code"]])
```
## 매 결정 기준
| cardinality | model | encoding |
|---|---|---|
| <10 | any | one-hot |
| 1050 | linear / NN | one-hot or embedding |
| 501000 | tree | target / frequency |
| >1000 | any | hashing / embedding |
| 매 ordinal | any | OrdinalEncoder |
**기본값**: `OneHotEncoder(handle_unknown="ignore")` in ColumnTransformer.
## 🔗 Graph
- 부모: [[Feature Engineering|Feature-Engineering]]
- 변형: [[Target-Encoding]]
- 응용: [[Logistic-Regression-Foundations|Logistic-Regression]]
- Adjacent: [[Sparse-Matrix]] · [[Curse-of-Dimensionality]]
## 🤖 LLM 활용
**언제**: 매 quick prototype, low-cardinality categorical, linear / tree baseline.
**언제 X**: 매 high-cardinality (>1000), text tokens (use embedding), online learning with new categories.
## ❌ 안티패턴
- **Train-only fit**: test set 의 unseen category 에 crash → `handle_unknown="ignore"`.
- **Drop-first with regularized model**: 불필요한 정보 손실.
- **OHE on high-cardinality without sparse**: memory blowup.
- **LabelEncoder for features**: fake ordinal 강제, linear model 망가짐.
- **Leak via target encoding without fold**: target encoding 사용 시 K-fold 필수.
## 🧪 검증 / 중복
- Verified (sklearn 1.4 docs, pandas 2.2 docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — sklearn/pandas patterns + cardinality decision matrix |