--- id: wiki-2026-0508-multivariate-analysis title: Multivariate Analysis category: 10_Wiki/Topics status: verified canonical_id: self aliases: [MVA, Multivariate Statistics] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [statistics, dimensionality-reduction, multivariate] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn/statsmodels --- # Multivariate Analysis ## 매 한 줄 > **"매 multiple correlated variables 매 동시에"**. 매 MVA는 covariance·correlation matrix를 base로 PCA/FA/CCA/MANOVA/discriminant analysis 매 통합, 매 2026 ML 시대에도 매 EDA·feature engineering·biostatistics·marketing research에서 매 indispensable foundation. ## 매 핵심 ### 매 covariance matrix Σ - Σᵢⱼ = E[(Xᵢ - μᵢ)(Xⱼ - μⱼ)]. - Eigendecomposition Σ = QΛQᵀ가 매 모든 multivariate 기법의 backbone. - Sample S = (1/(n-1)) XᶜᵀXᶜ. ### 매 family - **PCA**: max variance projection (eigen of Σ). - **FA (Factor Analysis)**: latent factors + idiosyncratic noise (X = ΛF + ε). - **CCA**: max correlation between two variable sets. - **LDA**: discriminant axes (between-class vs within-class scatter). - **MANOVA**: multivariate generalization of ANOVA (Wilks Λ, Pillai trace). - **MDS**: distance-preserving embedding. ### 매 응용 1. EDA on tabular data (correlation heatmap, biplot). 2. Feature engineering before tree models or MLP. 3. Genomics (gene expression PCA / FA). 4. Marketing segmentation (cluster + biplot). 5. Psychometrics (factor structure of survey). ## 💻 패턴 ### PCA — full pipeline ```python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt X_std = StandardScaler().fit_transform(X) pca = PCA(n_components=0.95) # keep 95% variance Z = pca.fit_transform(X_std) print(pca.explained_variance_ratio_.cumsum()) # Biplot loadings = pca.components_.T * np.sqrt(pca.explained_variance_) plt.scatter(Z[:,0], Z[:,1], alpha=0.3) for i, name in enumerate(feature_names): plt.arrow(0, 0, loadings[i,0]*3, loadings[i,1]*3, color='r') plt.text(loadings[i,0]*3.2, loadings[i,1]*3.2, name) ``` ### Factor Analysis with rotation ```python from sklearn.decomposition import FactorAnalysis fa = FactorAnalysis(n_components=3, rotation='varimax') fa.fit(X_std) print(fa.components_) # loadings ``` ### CCA (cross-modal) ```python from sklearn.cross_decomposition import CCA cca = CCA(n_components=2) cca.fit(X_view1, X_view2) U, V = cca.transform(X_view1, X_view2) # diag(corr(U, V)) = canonical correlations ``` ### Linear Discriminant Analysis ```python from sklearn.discriminant_analysis import LinearDiscriminantAnalysis lda = LinearDiscriminantAnalysis(n_components=2) Z = lda.fit_transform(X_std, y) # supervised projection ``` ### MANOVA via statsmodels ```python from statsmodels.multivariate.manova import MANOVA maov = MANOVA.from_formula('y1 + y2 + y3 ~ group', data=df) print(maov.mv_test()) # Wilks, Pillai, Hotelling, Roy ``` ### Mahalanobis distance (multivariate outliers) ```python import numpy as np mu = X.mean(axis=0) S_inv = np.linalg.inv(np.cov(X, rowvar=False)) def mahal(x): d = x - mu return np.sqrt(d @ S_inv @ d) # threshold: chi2.ppf(0.975, df=p) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Variance compression unsupervised | PCA | | Latent structure interpretation | Factor Analysis (with rotation) | | Two correlated groups of vars | CCA | | Supervised projection | LDA | | Group-mean comparison (multivariate) | MANOVA | | Distance-only data | MDS | | Outlier detection multivariate | Mahalanobis / Min Cov Det | **기본값**: 매 EDA에 PCA + correlation heatmap, 매 supervised에 LDA, 매 latent factor에 FA + varimax. ## 🔗 Graph - 부모: [[Statistics]] · [[Linear-Algebra-Foundations|Linear-Algebra]] - 변형: [[PCA]] · [[Factor-Analysis]] · [[LDA]] - 응용: [[EDA]] · [[Feature Engineering|Feature-Engineering]] - Adjacent: [[Dimensionality-Reduction]] · [[t-SNE]] · [[UMAP]] ## 🤖 LLM 활용 **언제**: 매 EDA narrative generation (PCA biplot 해석), factor labeling, MANOVA result writeup. **언제 X**: 매 actual decomposition computing (numpy/sklearn use). ## ❌ 안티패턴 - **No standardization**: 매 PCA before scaling → 매 large-magnitude vars dominate. - **PCA on nonlinear**: 매 swiss-roll에 매 PCA 매 사용 → 매 t-SNE/UMAP/Isomap 매 사용. - **FA without rotation**: 매 unrotated factors 매 interpret 어려움 — 매 varimax/promax 적용. - **MANOVA assumption**: 매 multivariate normality + equal cov 매 검증 X → wrong p-values. ## 🧪 검증 / 중복 - Verified (Johnson & Wichern "Applied Multivariate", Hardle & Simar). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full MVA toolkit (PCA/FA/CCA/LDA/MANOVA) |