Files
2nd/10_Wiki/Topics/AI_and_ML/Dimensionality-Reduction.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

294 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-dimensionality-reduction
title: Dimensionality Reduction
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [PCA, t-SNE, UMAP, autoencoder, curse of dimensionality, feature extraction]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [dimensionality-reduction, pca, tsne, umap, autoencoder, visualization, manifold-learning]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: scikit-learn / umap-learn / PyTorch
---
# Dimensionality Reduction
## 매 한 줄
> **"매 high-dim 의 essence 의 low-dim"**. 매 PCA (linear) → 매 t-SNE / UMAP (nonlinear, 시각화) → 매 Autoencoder / VAE (deep). 매 modern: 매 embedding (CLIP, sentence-transformers) 의 implicit dim reduction.
## 매 핵심 method
### Linear
#### PCA (Principal Component Analysis)
- 매 variance 의 maximum direction.
- 매 orthogonal axis.
- 매 SVD.
- 매 fast + interpretable.
#### LDA (Linear Discriminant Analysis)
- 매 class separation 의 maximize.
- 매 supervised.
#### Factor Analysis
- 매 latent factor 의 explain variance.
### Nonlinear (manifold)
#### t-SNE (Maaten 2008)
- 매 local neighborhood 의 preserve.
- 매 visualization 강.
- 매 global structure 의 weak.
- 매 stochastic.
#### UMAP (McInnes 2018)
- 매 t-SNE 의 successor.
- 매 faster + 매 global structure 도 better.
- 매 default for high-dim viz.
#### Isomap
- 매 geodesic distance 의 preserve.
#### LLE (Locally Linear Embedding).
### Neural
#### Autoencoder
- 매 bottleneck 의 dim reduce.
#### VAE (Variational AE)
- 매 probabilistic.
#### Self-Supervised Embedding
- 매 CLIP, BERT, sentence-transformers.
- 매 implicit dim reduction.
### 매 PaCMAP / TriMap (recent)
- 매 UMAP 의 variant.
- 매 better global structure.
### 매 응용
1. **Visualization** (2D / 3D): 매 t-SNE, UMAP.
2. **Speed** (preprocess): 매 PCA.
3. **Anomaly detection**: 매 autoencoder.
4. **Feature extraction**: 매 embedding.
5. **Compression**: 매 quantization + 매 embed.
6. **Clustering preprocessing**.
7. **RAG** (vector DB): 매 PCA / quantization.
### 매 curse of dimensionality
- 매 distance 의 meaningless.
- 매 sparsity in 매 high-dim.
- 매 sample requirement 의 exponential.
## 💻 패턴
### PCA (sklearn)
```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95) # 매 95% variance 의 keep
X_reduced = pca.fit_transform(X_scaled)
print(f'Original: {X.shape[1]}, reduced: {pca.n_components_}')
print(f'Explained variance: {pca.explained_variance_ratio_.cumsum()}')
```
### t-SNE
```python
from sklearn.manifold import TSNE
tsne = TSNE(
n_components=2,
perplexity=30,
n_iter=1000,
random_state=42,
)
X_2d = tsne.fit_transform(X[:5000]) # 매 t-SNE 의 slow → 매 sample
```
### UMAP (modern)
```python
import umap
reducer = umap.UMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
metric='cosine', # 매 embedding 에 좋음
random_state=42,
)
X_2d = reducer.fit_transform(X)
```
### Autoencoder (PyTorch)
```python
import torch.nn as nn
class AE(nn.Module):
def __init__(self, input_dim, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 128), nn.ReLU(),
nn.Linear(128, 64), nn.ReLU(),
nn.Linear(64, latent_dim),
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 64), nn.ReLU(),
nn.Linear(64, 128), nn.ReLU(),
nn.Linear(128, input_dim),
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z), z
# 매 latent 의 use
model = AE(input_dim=784)
# ... train ...
_, latent = model(X_test)
```
### Visualization combo (UMAP + scatter)
```python
import matplotlib.pyplot as plt
X_2d = umap.UMAP().fit_transform(X)
plt.figure(figsize=(10, 8))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', alpha=0.5, s=10)
plt.colorbar()
plt.title('UMAP projection')
plt.show()
```
### PCA for speed (vector DB preprocessing)
```python
from sklearn.decomposition import PCA
import faiss
# 매 매 768 의 OpenAI embedding → 매 256
embeddings = get_embeddings(documents)
pca = PCA(n_components=256)
reduced = pca.fit_transform(embeddings).astype('float32')
# 매 Faiss
index = faiss.IndexFlatIP(256)
index.add(reduced)
```
### Quantization (vector DB modern)
```python
import faiss
dim = 768
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFPQ(quantizer, dim, nlist=100, m=8, nbits=8)
# 매 8 byte 의 768-dim 의 represent — 매 매 100× compression.
index.train(embeddings_np)
index.add(embeddings_np)
```
### Word2Vec / CLIP-style (implicit reduction)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # 매 384-dim
embeddings = model.encode(sentences)
# 매 매 sentence (potentially infinite words) → 매 384-dim.
```
### Reconstruction error (anomaly)
```python
def detect_anomaly(model, X, threshold):
X_recon, _ = model(X)
error = ((X_recon - X) ** 2).mean(dim=1)
return error > threshold
```
### Choose dimension (elbow / cumvar)
```python
import numpy as np
import matplotlib.pyplot as plt
pca = PCA().fit(X_scaled)
cumvar = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumvar)
plt.xlabel('Component')
plt.ylabel('Cumulative variance')
plt.axhline(0.95, color='r', linestyle='--')
plt.show()
n_components = np.argmax(cumvar >= 0.95) + 1
```
### Manifold visualization comparison
```python
def viz_compare(X, labels):
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
for ax, (name, reducer) in zip(axes, [
('PCA', PCA(n_components=2)),
('t-SNE', TSNE(n_components=2, random_state=42)),
('UMAP', umap.UMAP(n_components=2, random_state=42)),
]):
proj = reducer.fit_transform(X)
ax.scatter(proj[:, 0], proj[:, 1], c=labels, cmap='tab10', s=5)
ax.set_title(name)
```
## 매 결정 기준
| 상황 | Method |
|---|---|
| Speed (preprocess) | PCA |
| Visualization | UMAP |
| Cluster preserve | UMAP |
| Variance interpret | PCA |
| Class-aware | LDA |
| Text → embedding | Sentence-transformer |
| Image → embedding | CLIP |
| Vector DB compress | PCA / PQ quantization |
| Anomaly | Autoencoder |
| Generative | VAE |
**기본값**: PCA (preprocess) + UMAP (viz) + embedding (semantic).
## 🔗 Graph
- 부모: [[Feature Engineering|Feature-Engineering]]
- 변형: [[PCA]] · [[t-SNE]] · [[UMAP]] · [[Autoencoder]] · [[VAE]]
- 응용: [[CLIP]] · [[Sentence-Transformers]] · [[Faiss]] · [[Anomaly-Detection]]
- Adjacent: [[Auto-Encoding]] · [[Bag of Words (BoW)]] · [[Bias-vs-Variance]]
## 🤖 LLM 활용
**언제**: 매 visualization. 매 vector DB. 매 cluster preprocessing. 매 anomaly detection.
**언제 X**: 매 already low-dim. 매 lossless 필수.
## ❌ 안티패턴
- **PCA without standardize**: 매 wrong principal component.
- **t-SNE 의 cluster size 의 interpret**: 매 not preserved.
- **UMAP 의 distance 의 absolute interpret**: 매 local 만.
- **Too aggressive reduction**: 매 information loss.
- **Forget train-test split**: 매 leakage in PCA.
## 🧪 검증 / 중복
- Verified (Jolliffe PCA, van der Maaten t-SNE, McInnes UMAP).
- 신뢰도 A.
- Related: [[Auto-Encoding]] · [[Bag of Words (BoW)]] · [[CLIP]] · [[Sentence-Transformers]] · [[Bias-vs-Variance]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — methods + 매 PCA / t-SNE / UMAP / AE / Faiss / quantization code |