Files
2nd/10_Wiki/Topics/AI_and_ML/Principle-Component-Analysis.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

176 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-principle-component-analysis
title: Principal Component Analysis
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [PCA, Karhunen-Loeve Transform, Principle Component Analysis]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [linear-algebra, dimensionality-reduction, unsupervised, statistics]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: scikit-learn / NumPy / PyTorch
---
# Principal Component Analysis
## 매 한 줄
> **"매 orthogonal axes of maximum variance — eigendecomposition of covariance, equivalent to SVD of centered data"**. Pearson 1901, Hotelling 1933 의 statistical foundation; 2026 still the default linear dim-reduction baseline despite t-SNE/UMAP for viz. Note: spelled **Principal** (not "Principle") — kept alias for findability.
## 매 핵심
### 매 mathematical definition
- Center data: X_c = X - mean(X).
- Covariance: C = X_c^T X_c / (n-1).
- Eigendecompose C = V Λ V^T; columns of V are principal axes.
- Project: Z = X_c V_k (top k components).
- Equivalent: SVD X_c = U Σ V^T → V same; singular values σ_i = sqrt((n-1) λ_i).
### 매 properties
- **Orthogonal**: components uncorrelated.
- **Variance-ordered**: first PC explains most variance.
- **Linear**: cannot capture curved manifolds (use kernel PCA / UMAP).
- **Rotation-invariant**: same answer regardless of axis labels.
- **Scale-sensitive**: standardize features first if scales differ.
### 매 variants
- **Kernel PCA**: nonlinear via kernel trick (RBF, polynomial).
- **Sparse PCA**: L1-regularized loadings for interpretability.
- **Robust PCA**: low-rank + sparse decomposition for outliers.
- **Probabilistic PCA**: latent Gaussian model — gives MLE objective.
- **Incremental / online PCA**: streaming data.
- **Randomized SVD**: O(n d k) instead of O(n d^2) for top-k.
### 매 modern usage (2026)
- **Embeddings analysis**: PCA on Claude / GPT-5 hidden states for interpretability (mech interp).
- **Whitening**: precondition before clustering, ICA, neural net training.
- **Compression**: still used in image / signal pipelines.
- **Data viz**: PCA → 50D, then UMAP/t-SNE → 2D (the standard combo).
## 💻 패턴
### scikit-learn PCA
```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
X_std = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95) # keep 95% variance
Z = pca.fit_transform(X_std)
print(f"#components for 95% var: {pca.n_components_}")
print(f"explained variance ratio: {pca.explained_variance_ratio_}")
```
### Manual PCA via SVD (numerical best)
```python
def pca(X, k):
Xc = X - X.mean(0)
U, s, Vt = np.linalg.svd(Xc, full_matrices=False)
components = Vt[:k]
explained_var = (s[:k] ** 2) / (X.shape[0] - 1)
Z = Xc @ components.T
return Z, components, explained_var
```
### Randomized SVD (fast for huge matrices)
```python
from sklearn.utils.extmath import randomized_svd
U, s, Vt = randomized_svd(X_centered, n_components=50, random_state=42)
# 100x faster than full SVD for d >> k
```
### Kernel PCA (nonlinear)
```python
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.1)
Z = kpca.fit_transform(X)
```
### Incremental PCA (streaming)
```python
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=50, batch_size=1024)
for batch in stream:
ipca.partial_fit(batch)
Z = ipca.transform(X_test)
```
### Whitening before downstream model
```python
pca = PCA(whiten=True).fit(X_train)
X_train_w = pca.transform(X_train)
X_test_w = pca.transform(X_test)
# now features have unit variance, zero correlation
```
### PCA for interpreting transformer hidden states
```python
import torch
hidden = model.encode(prompts) # (B, D=4096)
pca = PCA(n_components=8)
Z = pca.fit_transform(hidden.cpu().numpy())
# Top component often correlates with sentiment / topic / refusal.
```
### Reconstruction error (anomaly detection)
```python
pca = PCA(n_components=10).fit(X_train)
recon = pca.inverse_transform(pca.transform(X))
err = ((X - recon) ** 2).sum(axis=1)
anomalies = err > np.percentile(err, 99)
```
### Choosing k via scree plot / elbow
```python
import matplotlib.pyplot as plt
pca_full = PCA().fit(X_std)
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.axhline(0.95, ls="--"); plt.xlabel("# components"); plt.ylabel("cumulative var")
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Linear dim-reduction baseline | PCA |
| Visualization to 2D | PCA→50D → UMAP→2D |
| Nonlinear manifold | Kernel PCA / UMAP / autoencoder |
| Streaming / huge data | IncrementalPCA / randomized SVD |
| Need interpretable loadings | Sparse PCA |
| Outliers in data | Robust PCA |
| Probabilistic / missing data | Probabilistic PCA / EM-PCA |
**기본값**: StandardScaler → PCA(n_components=0.95) → downstream model.
## 🔗 Graph
- 부모: [[Linear-Algebra-Foundations|Linear-Algebra]] · [[Dimensionality-Reduction]]
- 응용: [[Feature Engineering|Feature-Engineering]] · [[Anomaly-Detection]] · [[Mechanistic-Interpretability]]
- Adjacent: [[SVD]] · [[ICA]] · [[Factor-Analysis]] · [[Autoencoder]] · [[UMAP]]
## 🤖 LLM 활용
**언제**: linear dim-reduction, whitening, denoising, hidden-state analysis, baseline before ML model.
**언제 X**: nonlinear manifold (use UMAP/autoencoder), categorical-only data (use MCA), interpretable original features required (use feature selection).
## ❌ 안티패턴
- **No standardization**: features with large scale dominate components.
- **PCA on labels-included data**: leakage if used for supervised pipeline.
- **Reading PC1 as "the cause"**: components are statistical, not causal.
- **PCA → tree models**: GBDT doesn't benefit from rotation; just hurts interpretability.
- **Forgetting sign ambiguity**: V and -V both valid; component direction is arbitrary.
## 🧪 검증 / 중복
- Verified (Pearson 1901, Hotelling 1933, Jolliffe 2002 textbook, sklearn docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — canonical PCA reference + 2026 mech interp use |