f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
176 lines
6.2 KiB
Markdown
176 lines
6.2 KiB
Markdown
---
|
||
id: wiki-2026-0508-principle-component-analysis
|
||
title: Principal Component Analysis
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [PCA, Karhunen-Loeve Transform, Principle Component Analysis]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.95
|
||
verification_status: applied
|
||
tags: [linear-algebra, dimensionality-reduction, unsupervised, statistics]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: scikit-learn / NumPy / PyTorch
|
||
---
|
||
|
||
# Principal Component Analysis
|
||
|
||
## 매 한 줄
|
||
> **"매 orthogonal axes of maximum variance — eigendecomposition of covariance, equivalent to SVD of centered data"**. Pearson 1901, Hotelling 1933 의 statistical foundation; 2026 still the default linear dim-reduction baseline despite t-SNE/UMAP for viz. Note: spelled **Principal** (not "Principle") — kept alias for findability.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 mathematical definition
|
||
- Center data: X_c = X - mean(X).
|
||
- Covariance: C = X_c^T X_c / (n-1).
|
||
- Eigendecompose C = V Λ V^T; columns of V are principal axes.
|
||
- Project: Z = X_c V_k (top k components).
|
||
- Equivalent: SVD X_c = U Σ V^T → V same; singular values σ_i = sqrt((n-1) λ_i).
|
||
|
||
### 매 properties
|
||
- **Orthogonal**: components uncorrelated.
|
||
- **Variance-ordered**: first PC explains most variance.
|
||
- **Linear**: cannot capture curved manifolds (use kernel PCA / UMAP).
|
||
- **Rotation-invariant**: same answer regardless of axis labels.
|
||
- **Scale-sensitive**: standardize features first if scales differ.
|
||
|
||
### 매 variants
|
||
- **Kernel PCA**: nonlinear via kernel trick (RBF, polynomial).
|
||
- **Sparse PCA**: L1-regularized loadings for interpretability.
|
||
- **Robust PCA**: low-rank + sparse decomposition for outliers.
|
||
- **Probabilistic PCA**: latent Gaussian model — gives MLE objective.
|
||
- **Incremental / online PCA**: streaming data.
|
||
- **Randomized SVD**: O(n d k) instead of O(n d^2) for top-k.
|
||
|
||
### 매 modern usage (2026)
|
||
- **Embeddings analysis**: PCA on Claude / GPT-5 hidden states for interpretability (mech interp).
|
||
- **Whitening**: precondition before clustering, ICA, neural net training.
|
||
- **Compression**: still used in image / signal pipelines.
|
||
- **Data viz**: PCA → 50D, then UMAP/t-SNE → 2D (the standard combo).
|
||
|
||
## 💻 패턴
|
||
|
||
### scikit-learn PCA
|
||
```python
|
||
from sklearn.decomposition import PCA
|
||
from sklearn.preprocessing import StandardScaler
|
||
import numpy as np
|
||
|
||
X_std = StandardScaler().fit_transform(X)
|
||
pca = PCA(n_components=0.95) # keep 95% variance
|
||
Z = pca.fit_transform(X_std)
|
||
print(f"#components for 95% var: {pca.n_components_}")
|
||
print(f"explained variance ratio: {pca.explained_variance_ratio_}")
|
||
```
|
||
|
||
### Manual PCA via SVD (numerical best)
|
||
```python
|
||
def pca(X, k):
|
||
Xc = X - X.mean(0)
|
||
U, s, Vt = np.linalg.svd(Xc, full_matrices=False)
|
||
components = Vt[:k]
|
||
explained_var = (s[:k] ** 2) / (X.shape[0] - 1)
|
||
Z = Xc @ components.T
|
||
return Z, components, explained_var
|
||
```
|
||
|
||
### Randomized SVD (fast for huge matrices)
|
||
```python
|
||
from sklearn.utils.extmath import randomized_svd
|
||
U, s, Vt = randomized_svd(X_centered, n_components=50, random_state=42)
|
||
# 100x faster than full SVD for d >> k
|
||
```
|
||
|
||
### Kernel PCA (nonlinear)
|
||
```python
|
||
from sklearn.decomposition import KernelPCA
|
||
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.1)
|
||
Z = kpca.fit_transform(X)
|
||
```
|
||
|
||
### Incremental PCA (streaming)
|
||
```python
|
||
from sklearn.decomposition import IncrementalPCA
|
||
ipca = IncrementalPCA(n_components=50, batch_size=1024)
|
||
for batch in stream:
|
||
ipca.partial_fit(batch)
|
||
Z = ipca.transform(X_test)
|
||
```
|
||
|
||
### Whitening before downstream model
|
||
```python
|
||
pca = PCA(whiten=True).fit(X_train)
|
||
X_train_w = pca.transform(X_train)
|
||
X_test_w = pca.transform(X_test)
|
||
# now features have unit variance, zero correlation
|
||
```
|
||
|
||
### PCA for interpreting transformer hidden states
|
||
```python
|
||
import torch
|
||
hidden = model.encode(prompts) # (B, D=4096)
|
||
pca = PCA(n_components=8)
|
||
Z = pca.fit_transform(hidden.cpu().numpy())
|
||
# Top component often correlates with sentiment / topic / refusal.
|
||
```
|
||
|
||
### Reconstruction error (anomaly detection)
|
||
```python
|
||
pca = PCA(n_components=10).fit(X_train)
|
||
recon = pca.inverse_transform(pca.transform(X))
|
||
err = ((X - recon) ** 2).sum(axis=1)
|
||
anomalies = err > np.percentile(err, 99)
|
||
```
|
||
|
||
### Choosing k via scree plot / elbow
|
||
```python
|
||
import matplotlib.pyplot as plt
|
||
pca_full = PCA().fit(X_std)
|
||
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
|
||
plt.axhline(0.95, ls="--"); plt.xlabel("# components"); plt.ylabel("cumulative var")
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| Linear dim-reduction baseline | PCA |
|
||
| Visualization to 2D | PCA→50D → UMAP→2D |
|
||
| Nonlinear manifold | Kernel PCA / UMAP / autoencoder |
|
||
| Streaming / huge data | IncrementalPCA / randomized SVD |
|
||
| Need interpretable loadings | Sparse PCA |
|
||
| Outliers in data | Robust PCA |
|
||
| Probabilistic / missing data | Probabilistic PCA / EM-PCA |
|
||
|
||
**기본값**: StandardScaler → PCA(n_components=0.95) → downstream model.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Linear-Algebra-Foundations|Linear-Algebra]] · [[Dimensionality-Reduction]]
|
||
- 응용: [[Feature Engineering|Feature-Engineering]] · [[Anomaly-Detection]] · [[Mechanistic-Interpretability]]
|
||
- Adjacent: [[SVD]] · [[ICA]] · [[Factor-Analysis]] · [[Autoencoder]] · [[UMAP]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: linear dim-reduction, whitening, denoising, hidden-state analysis, baseline before ML model.
|
||
**언제 X**: nonlinear manifold (use UMAP/autoencoder), categorical-only data (use MCA), interpretable original features required (use feature selection).
|
||
|
||
## ❌ 안티패턴
|
||
- **No standardization**: features with large scale dominate components.
|
||
- **PCA on labels-included data**: leakage if used for supervised pipeline.
|
||
- **Reading PC1 as "the cause"**: components are statistical, not causal.
|
||
- **PCA → tree models**: GBDT doesn't benefit from rotation; just hurts interpretability.
|
||
- **Forgetting sign ambiguity**: V and -V both valid; component direction is arbitrary.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Pearson 1901, Hotelling 1933, Jolliffe 2002 textbook, sklearn docs).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — canonical PCA reference + 2026 mech interp use |
|