--- id: wiki-2026-0508-principle-component-analysis title: Principal Component Analysis category: 10_Wiki/Topics status: verified canonical_id: self aliases: [PCA, Karhunen-Loeve Transform, Principle Component Analysis] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [linear-algebra, dimensionality-reduction, unsupervised, statistics] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn / NumPy / PyTorch --- # Principal Component Analysis ## 매 한 줄 > **"매 orthogonal axes of maximum variance — eigendecomposition of covariance, equivalent to SVD of centered data"**. Pearson 1901, Hotelling 1933 의 statistical foundation; 2026 still the default linear dim-reduction baseline despite t-SNE/UMAP for viz. Note: spelled **Principal** (not "Principle") — kept alias for findability. ## 매 핵심 ### 매 mathematical definition - Center data: X_c = X - mean(X). - Covariance: C = X_c^T X_c / (n-1). - Eigendecompose C = V Λ V^T; columns of V are principal axes. - Project: Z = X_c V_k (top k components). - Equivalent: SVD X_c = U Σ V^T → V same; singular values σ_i = sqrt((n-1) λ_i). ### 매 properties - **Orthogonal**: components uncorrelated. - **Variance-ordered**: first PC explains most variance. - **Linear**: cannot capture curved manifolds (use kernel PCA / UMAP). - **Rotation-invariant**: same answer regardless of axis labels. - **Scale-sensitive**: standardize features first if scales differ. ### 매 variants - **Kernel PCA**: nonlinear via kernel trick (RBF, polynomial). - **Sparse PCA**: L1-regularized loadings for interpretability. - **Robust PCA**: low-rank + sparse decomposition for outliers. - **Probabilistic PCA**: latent Gaussian model — gives MLE objective. - **Incremental / online PCA**: streaming data. - **Randomized SVD**: O(n d k) instead of O(n d^2) for top-k. ### 매 modern usage (2026) - **Embeddings analysis**: PCA on Claude / GPT-5 hidden states for interpretability (mech interp). - **Whitening**: precondition before clustering, ICA, neural net training. - **Compression**: still used in image / signal pipelines. - **Data viz**: PCA → 50D, then UMAP/t-SNE → 2D (the standard combo). ## 💻 패턴 ### scikit-learn PCA ```python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler import numpy as np X_std = StandardScaler().fit_transform(X) pca = PCA(n_components=0.95) # keep 95% variance Z = pca.fit_transform(X_std) print(f"#components for 95% var: {pca.n_components_}") print(f"explained variance ratio: {pca.explained_variance_ratio_}") ``` ### Manual PCA via SVD (numerical best) ```python def pca(X, k): Xc = X - X.mean(0) U, s, Vt = np.linalg.svd(Xc, full_matrices=False) components = Vt[:k] explained_var = (s[:k] ** 2) / (X.shape[0] - 1) Z = Xc @ components.T return Z, components, explained_var ``` ### Randomized SVD (fast for huge matrices) ```python from sklearn.utils.extmath import randomized_svd U, s, Vt = randomized_svd(X_centered, n_components=50, random_state=42) # 100x faster than full SVD for d >> k ``` ### Kernel PCA (nonlinear) ```python from sklearn.decomposition import KernelPCA kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.1) Z = kpca.fit_transform(X) ``` ### Incremental PCA (streaming) ```python from sklearn.decomposition import IncrementalPCA ipca = IncrementalPCA(n_components=50, batch_size=1024) for batch in stream: ipca.partial_fit(batch) Z = ipca.transform(X_test) ``` ### Whitening before downstream model ```python pca = PCA(whiten=True).fit(X_train) X_train_w = pca.transform(X_train) X_test_w = pca.transform(X_test) # now features have unit variance, zero correlation ``` ### PCA for interpreting transformer hidden states ```python import torch hidden = model.encode(prompts) # (B, D=4096) pca = PCA(n_components=8) Z = pca.fit_transform(hidden.cpu().numpy()) # Top component often correlates with sentiment / topic / refusal. ``` ### Reconstruction error (anomaly detection) ```python pca = PCA(n_components=10).fit(X_train) recon = pca.inverse_transform(pca.transform(X)) err = ((X - recon) ** 2).sum(axis=1) anomalies = err > np.percentile(err, 99) ``` ### Choosing k via scree plot / elbow ```python import matplotlib.pyplot as plt pca_full = PCA().fit(X_std) plt.plot(np.cumsum(pca_full.explained_variance_ratio_)) plt.axhline(0.95, ls="--"); plt.xlabel("# components"); plt.ylabel("cumulative var") ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Linear dim-reduction baseline | PCA | | Visualization to 2D | PCA→50D → UMAP→2D | | Nonlinear manifold | Kernel PCA / UMAP / autoencoder | | Streaming / huge data | IncrementalPCA / randomized SVD | | Need interpretable loadings | Sparse PCA | | Outliers in data | Robust PCA | | Probabilistic / missing data | Probabilistic PCA / EM-PCA | **기본값**: StandardScaler → PCA(n_components=0.95) → downstream model. ## 🔗 Graph - 부모: [[Linear-Algebra-Foundations|Linear-Algebra]] · [[Dimensionality-Reduction]] - 응용: [[Feature Engineering|Feature-Engineering]] · [[Anomaly-Detection]] · [[Mechanistic-Interpretability]] - Adjacent: [[SVD]] · [[ICA]] · [[Factor-Analysis]] · [[Autoencoder]] · [[UMAP]] ## 🤖 LLM 활용 **언제**: linear dim-reduction, whitening, denoising, hidden-state analysis, baseline before ML model. **언제 X**: nonlinear manifold (use UMAP/autoencoder), categorical-only data (use MCA), interpretable original features required (use feature selection). ## ❌ 안티패턴 - **No standardization**: features with large scale dominate components. - **PCA on labels-included data**: leakage if used for supervised pipeline. - **Reading PC1 as "the cause"**: components are statistical, not causal. - **PCA → tree models**: GBDT doesn't benefit from rotation; just hurts interpretability. - **Forgetting sign ambiguity**: V and -V both valid; component direction is arbitrary. ## 🧪 검증 / 중복 - Verified (Pearson 1901, Hotelling 1933, Jolliffe 2002 textbook, sklearn docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — canonical PCA reference + 2026 mech interp use |