Files
2nd/10_Wiki/Topics/AI_and_ML/Principle-Component-Analysis.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.2 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-principle-component-analysis Principal Component Analysis 10_Wiki/Topics verified self
PCA
Karhunen-Loeve Transform
Principle Component Analysis
none A 0.95 applied
linear-algebra
dimensionality-reduction
unsupervised
statistics
2026-05-10 pending
language framework
Python scikit-learn / NumPy / PyTorch

Principal Component Analysis

매 한 줄

"매 orthogonal axes of maximum variance — eigendecomposition of covariance, equivalent to SVD of centered data". Pearson 1901, Hotelling 1933 의 statistical foundation; 2026 still the default linear dim-reduction baseline despite t-SNE/UMAP for viz. Note: spelled Principal (not "Principle") — kept alias for findability.

매 핵심

매 mathematical definition

  • Center data: X_c = X - mean(X).
  • Covariance: C = X_c^T X_c / (n-1).
  • Eigendecompose C = V Λ V^T; columns of V are principal axes.
  • Project: Z = X_c V_k (top k components).
  • Equivalent: SVD X_c = U Σ V^T → V same; singular values σ_i = sqrt((n-1) λ_i).

매 properties

  • Orthogonal: components uncorrelated.
  • Variance-ordered: first PC explains most variance.
  • Linear: cannot capture curved manifolds (use kernel PCA / UMAP).
  • Rotation-invariant: same answer regardless of axis labels.
  • Scale-sensitive: standardize features first if scales differ.

매 variants

  • Kernel PCA: nonlinear via kernel trick (RBF, polynomial).
  • Sparse PCA: L1-regularized loadings for interpretability.
  • Robust PCA: low-rank + sparse decomposition for outliers.
  • Probabilistic PCA: latent Gaussian model — gives MLE objective.
  • Incremental / online PCA: streaming data.
  • Randomized SVD: O(n d k) instead of O(n d^2) for top-k.

매 modern usage (2026)

  • Embeddings analysis: PCA on Claude / GPT-5 hidden states for interpretability (mech interp).
  • Whitening: precondition before clustering, ICA, neural net training.
  • Compression: still used in image / signal pipelines.
  • Data viz: PCA → 50D, then UMAP/t-SNE → 2D (the standard combo).

💻 패턴

scikit-learn PCA

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

X_std = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95)  # keep 95% variance
Z = pca.fit_transform(X_std)
print(f"#components for 95% var: {pca.n_components_}")
print(f"explained variance ratio: {pca.explained_variance_ratio_}")

Manual PCA via SVD (numerical best)

def pca(X, k):
    Xc = X - X.mean(0)
    U, s, Vt = np.linalg.svd(Xc, full_matrices=False)
    components = Vt[:k]
    explained_var = (s[:k] ** 2) / (X.shape[0] - 1)
    Z = Xc @ components.T
    return Z, components, explained_var

Randomized SVD (fast for huge matrices)

from sklearn.utils.extmath import randomized_svd
U, s, Vt = randomized_svd(X_centered, n_components=50, random_state=42)
# 100x faster than full SVD for d >> k

Kernel PCA (nonlinear)

from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.1)
Z = kpca.fit_transform(X)

Incremental PCA (streaming)

from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=50, batch_size=1024)
for batch in stream:
    ipca.partial_fit(batch)
Z = ipca.transform(X_test)

Whitening before downstream model

pca = PCA(whiten=True).fit(X_train)
X_train_w = pca.transform(X_train)
X_test_w  = pca.transform(X_test)
# now features have unit variance, zero correlation

PCA for interpreting transformer hidden states

import torch
hidden = model.encode(prompts)  # (B, D=4096)
pca = PCA(n_components=8)
Z = pca.fit_transform(hidden.cpu().numpy())
# Top component often correlates with sentiment / topic / refusal.

Reconstruction error (anomaly detection)

pca = PCA(n_components=10).fit(X_train)
recon = pca.inverse_transform(pca.transform(X))
err = ((X - recon) ** 2).sum(axis=1)
anomalies = err > np.percentile(err, 99)

Choosing k via scree plot / elbow

import matplotlib.pyplot as plt
pca_full = PCA().fit(X_std)
plt.plot(np.cumsum(pca_full.explained_variance_ratio_))
plt.axhline(0.95, ls="--"); plt.xlabel("# components"); plt.ylabel("cumulative var")

매 결정 기준

상황 Approach
Linear dim-reduction baseline PCA
Visualization to 2D PCA→50D → UMAP→2D
Nonlinear manifold Kernel PCA / UMAP / autoencoder
Streaming / huge data IncrementalPCA / randomized SVD
Need interpretable loadings Sparse PCA
Outliers in data Robust PCA
Probabilistic / missing data Probabilistic PCA / EM-PCA

기본값: StandardScaler → PCA(n_components=0.95) → downstream model.

🔗 Graph

🤖 LLM 활용

언제: linear dim-reduction, whitening, denoising, hidden-state analysis, baseline before ML model. 언제 X: nonlinear manifold (use UMAP/autoencoder), categorical-only data (use MCA), interpretable original features required (use feature selection).

안티패턴

  • No standardization: features with large scale dominate components.
  • PCA on labels-included data: leakage if used for supervised pipeline.
  • Reading PC1 as "the cause": components are statistical, not causal.
  • PCA → tree models: GBDT doesn't benefit from rotation; just hurts interpretability.
  • Forgetting sign ambiguity: V and -V both valid; component direction is arbitrary.

🧪 검증 / 중복

  • Verified (Pearson 1901, Hotelling 1933, Jolliffe 2002 textbook, sklearn docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — canonical PCA reference + 2026 mech interp use