---
id: wiki-2026-0508-multivariate-analysis
title: Multivariate Analysis
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [MVA, Multivariate Statistics]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [statistics, dimensionality-reduction, multivariate]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: scikit-learn/statsmodels
---

# Multivariate Analysis

## 매 한 줄
> **"매 multiple correlated variables 매 동시에"**. 매 MVA는 covariance·correlation matrix를 base로 PCA/FA/CCA/MANOVA/discriminant analysis 매 통합, 매 2026 ML 시대에도 매 EDA·feature engineering·biostatistics·marketing research에서 매 indispensable foundation.

## 매 핵심

### 매 covariance matrix Σ
- Σᵢⱼ = E[(Xᵢ - μᵢ)(Xⱼ - μⱼ)].
- Eigendecomposition Σ = QΛQᵀ가 매 모든 multivariate 기법의 backbone.
- Sample S = (1/(n-1)) XᶜᵀXᶜ.

### 매 family
- **PCA**: max variance projection (eigen of Σ).
- **FA (Factor Analysis)**: latent factors + idiosyncratic noise (X = ΛF + ε).
- **CCA**: max correlation between two variable sets.
- **LDA**: discriminant axes (between-class vs within-class scatter).
- **MANOVA**: multivariate generalization of ANOVA (Wilks Λ, Pillai trace).
- **MDS**: distance-preserving embedding.

### 매 응용
1. EDA on tabular data (correlation heatmap, biplot).
2. Feature engineering before tree models or MLP.
3. Genomics (gene expression PCA / FA).
4. Marketing segmentation (cluster + biplot).
5. Psychometrics (factor structure of survey).

## 💻 패턴

### PCA — full pipeline
```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X_std = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95)  # keep 95% variance
Z = pca.fit_transform(X_std)
print(pca.explained_variance_ratio_.cumsum())

# Biplot
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
plt.scatter(Z[:,0], Z[:,1], alpha=0.3)
for i, name in enumerate(feature_names):
    plt.arrow(0, 0, loadings[i,0]*3, loadings[i,1]*3, color='r')
    plt.text(loadings[i,0]*3.2, loadings[i,1]*3.2, name)
```

### Factor Analysis with rotation
```python
from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=3, rotation='varimax')
fa.fit(X_std)
print(fa.components_)  # loadings
```

### CCA (cross-modal)
```python
from sklearn.cross_decomposition import CCA
cca = CCA(n_components=2)
cca.fit(X_view1, X_view2)
U, V = cca.transform(X_view1, X_view2)
# diag(corr(U, V)) = canonical correlations
```

### Linear Discriminant Analysis
```python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
Z = lda.fit_transform(X_std, y)  # supervised projection
```

### MANOVA via statsmodels
```python
from statsmodels.multivariate.manova import MANOVA
maov = MANOVA.from_formula('y1 + y2 + y3 ~ group', data=df)
print(maov.mv_test())  # Wilks, Pillai, Hotelling, Roy
```

### Mahalanobis distance (multivariate outliers)
```python
import numpy as np
mu = X.mean(axis=0)
S_inv = np.linalg.inv(np.cov(X, rowvar=False))
def mahal(x):
    d = x - mu
    return np.sqrt(d @ S_inv @ d)
# threshold: chi2.ppf(0.975, df=p)
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Variance compression unsupervised | PCA |
| Latent structure interpretation | Factor Analysis (with rotation) |
| Two correlated groups of vars | CCA |
| Supervised projection | LDA |
| Group-mean comparison (multivariate) | MANOVA |
| Distance-only data | MDS |
| Outlier detection multivariate | Mahalanobis / Min Cov Det |

**기본값**: 매 EDA에 PCA + correlation heatmap, 매 supervised에 LDA, 매 latent factor에 FA + varimax.

## 🔗 Graph
- 부모: [[Statistics]] · [[Linear-Algebra-Foundations|Linear-Algebra]]
- 변형: [[PCA]] · [[Factor-Analysis]] · [[LDA]]
- 응용: [[EDA]] · [[Feature Engineering|Feature-Engineering]]
- Adjacent: [[Dimensionality-Reduction]] · [[t-SNE]] · [[UMAP]]

## 🤖 LLM 활용
**언제**: 매 EDA narrative generation (PCA biplot 해석), factor labeling, MANOVA result writeup.
**언제 X**: 매 actual decomposition computing (numpy/sklearn use).

## ❌ 안티패턴
- **No standardization**: 매 PCA before scaling → 매 large-magnitude vars dominate.
- **PCA on nonlinear**: 매 swiss-roll에 매 PCA 매 사용 → 매 t-SNE/UMAP/Isomap 매 사용.
- **FA without rotation**: 매 unrotated factors 매 interpret 어려움 — 매 varimax/promax 적용.
- **MANOVA assumption**: 매 multivariate normality + equal cov 매 검증 X → wrong p-values.

## 🧪 검증 / 중복
- Verified (Johnson & Wichern "Applied Multivariate", Hardle & Simar).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full MVA toolkit (PCA/FA/CCA/LDA/MANOVA) |