Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Multivariate-Analysis.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-multivariate-analysis Multivariate Analysis 10_Wiki/Topics verified self
MVA
Multivariate Statistics
none A 0.9 applied
statistics
dimensionality-reduction
multivariate
2026-05-10 pending
language framework
Python scikit-learn/statsmodels

Multivariate Analysis

매 한 줄

"매 multiple correlated variables 매 동시에". 매 MVA는 covariance·correlation matrix를 base로 PCA/FA/CCA/MANOVA/discriminant analysis 매 통합, 매 2026 ML 시대에도 매 EDA·feature engineering·biostatistics·marketing research에서 매 indispensable foundation.

매 핵심

매 covariance matrix Σ

  • Σᵢⱼ = E[(Xᵢ - μᵢ)(Xⱼ - μⱼ)].
  • Eigendecomposition Σ = QΛQᵀ가 매 모든 multivariate 기법의 backbone.
  • Sample S = (1/(n-1)) XᶜᵀXᶜ.

매 family

  • PCA: max variance projection (eigen of Σ).
  • FA (Factor Analysis): latent factors + idiosyncratic noise (X = ΛF + ε).
  • CCA: max correlation between two variable sets.
  • LDA: discriminant axes (between-class vs within-class scatter).
  • MANOVA: multivariate generalization of ANOVA (Wilks Λ, Pillai trace).
  • MDS: distance-preserving embedding.

매 응용

  1. EDA on tabular data (correlation heatmap, biplot).
  2. Feature engineering before tree models or MLP.
  3. Genomics (gene expression PCA / FA).
  4. Marketing segmentation (cluster + biplot).
  5. Psychometrics (factor structure of survey).

💻 패턴

PCA — full pipeline

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X_std = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95)  # keep 95% variance
Z = pca.fit_transform(X_std)
print(pca.explained_variance_ratio_.cumsum())

# Biplot
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
plt.scatter(Z[:,0], Z[:,1], alpha=0.3)
for i, name in enumerate(feature_names):
    plt.arrow(0, 0, loadings[i,0]*3, loadings[i,1]*3, color='r')
    plt.text(loadings[i,0]*3.2, loadings[i,1]*3.2, name)

Factor Analysis with rotation

from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=3, rotation='varimax')
fa.fit(X_std)
print(fa.components_)  # loadings

CCA (cross-modal)

from sklearn.cross_decomposition import CCA
cca = CCA(n_components=2)
cca.fit(X_view1, X_view2)
U, V = cca.transform(X_view1, X_view2)
# diag(corr(U, V)) = canonical correlations

Linear Discriminant Analysis

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
Z = lda.fit_transform(X_std, y)  # supervised projection

MANOVA via statsmodels

from statsmodels.multivariate.manova import MANOVA
maov = MANOVA.from_formula('y1 + y2 + y3 ~ group', data=df)
print(maov.mv_test())  # Wilks, Pillai, Hotelling, Roy

Mahalanobis distance (multivariate outliers)

import numpy as np
mu = X.mean(axis=0)
S_inv = np.linalg.inv(np.cov(X, rowvar=False))
def mahal(x):
    d = x - mu
    return np.sqrt(d @ S_inv @ d)
# threshold: chi2.ppf(0.975, df=p)

매 결정 기준

상황 Approach
Variance compression unsupervised PCA
Latent structure interpretation Factor Analysis (with rotation)
Two correlated groups of vars CCA
Supervised projection LDA
Group-mean comparison (multivariate) MANOVA
Distance-only data MDS
Outlier detection multivariate Mahalanobis / Min Cov Det

기본값: 매 EDA에 PCA + correlation heatmap, 매 supervised에 LDA, 매 latent factor에 FA + varimax.

🔗 Graph

🤖 LLM 활용

언제: 매 EDA narrative generation (PCA biplot 해석), factor labeling, MANOVA result writeup. 언제 X: 매 actual decomposition computing (numpy/sklearn use).

안티패턴

  • No standardization: 매 PCA before scaling → 매 large-magnitude vars dominate.
  • PCA on nonlinear: 매 swiss-roll에 매 PCA 매 사용 → 매 t-SNE/UMAP/Isomap 매 사용.
  • FA without rotation: 매 unrotated factors 매 interpret 어려움 — 매 varimax/promax 적용.
  • MANOVA assumption: 매 multivariate normality + equal cov 매 검증 X → wrong p-values.

🧪 검증 / 중복

  • Verified (Johnson & Wichern "Applied Multivariate", Hardle & Simar).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full MVA toolkit (PCA/FA/CCA/LDA/MANOVA)