Files
2nd/10_Wiki/Topics/AI_and_ML/Dimensionality-Reduction.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

294 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-dimensionality-reduction
title: Dimensionality Reduction
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [PCA, t-SNE, UMAP, autoencoder, curse of dimensionality, feature extraction]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [dimensionality-reduction, pca, tsne, umap, autoencoder, visualization, manifold-learning]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: scikit-learn / umap-learn / PyTorch
---
# Dimensionality Reduction
## 매 한 줄
> **"매 high-dim 의 essence 의 low-dim"**. 매 PCA (linear) → 매 t-SNE / UMAP (nonlinear, 시각화) → 매 Autoencoder / VAE (deep). 매 modern: 매 embedding (CLIP, sentence-transformers) 의 implicit dim reduction.
## 매 핵심 method
### Linear
#### PCA (Principal Component Analysis)
- 매 variance 의 maximum direction.
- 매 orthogonal axis.
- 매 SVD.
- 매 fast + interpretable.
#### LDA (Linear Discriminant Analysis)
- 매 class separation 의 maximize.
- 매 supervised.
#### Factor Analysis
- 매 latent factor 의 explain variance.
### Nonlinear (manifold)
#### t-SNE (Maaten 2008)
- 매 local neighborhood 의 preserve.
- 매 visualization 강.
- 매 global structure 의 weak.
- 매 stochastic.
#### UMAP (McInnes 2018)
- 매 t-SNE 의 successor.
- 매 faster + 매 global structure 도 better.
- 매 default for high-dim viz.
#### Isomap
- 매 geodesic distance 의 preserve.
#### LLE (Locally Linear Embedding).
### Neural
#### Autoencoder
- 매 bottleneck 의 dim reduce.
#### VAE (Variational AE)
- 매 probabilistic.
#### Self-Supervised Embedding
- 매 CLIP, BERT, sentence-transformers.
- 매 implicit dim reduction.
### 매 PaCMAP / TriMap (recent)
- 매 UMAP 의 variant.
- 매 better global structure.
### 매 응용
1. **Visualization** (2D / 3D): 매 t-SNE, UMAP.
2. **Speed** (preprocess): 매 PCA.
3. **Anomaly detection**: 매 autoencoder.
4. **Feature extraction**: 매 embedding.
5. **Compression**: 매 quantization + 매 embed.
6. **Clustering preprocessing**.
7. **RAG** (vector DB): 매 PCA / quantization.
### 매 curse of dimensionality
- 매 distance 의 meaningless.
- 매 sparsity in 매 high-dim.
- 매 sample requirement 의 exponential.
## 💻 패턴
### PCA (sklearn)
```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=0.95) # 매 95% variance 의 keep
X_reduced = pca.fit_transform(X_scaled)
print(f'Original: {X.shape[1]}, reduced: {pca.n_components_}')
print(f'Explained variance: {pca.explained_variance_ratio_.cumsum()}')
```
### t-SNE
```python
from sklearn.manifold import TSNE
tsne = TSNE(
n_components=2,
perplexity=30,
n_iter=1000,
random_state=42,
)
X_2d = tsne.fit_transform(X[:5000]) # 매 t-SNE 의 slow → 매 sample
```
### UMAP (modern)
```python
import umap
reducer = umap.UMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
metric='cosine', # 매 embedding 에 좋음
random_state=42,
)
X_2d = reducer.fit_transform(X)
```
### Autoencoder (PyTorch)
```python
import torch.nn as nn
class AE(nn.Module):
def __init__(self, input_dim, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 128), nn.ReLU(),
nn.Linear(128, 64), nn.ReLU(),
nn.Linear(64, latent_dim),
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 64), nn.ReLU(),
nn.Linear(64, 128), nn.ReLU(),
nn.Linear(128, input_dim),
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z), z
# 매 latent 의 use
model = AE(input_dim=784)
# ... train ...
_, latent = model(X_test)
```
### Visualization combo (UMAP + scatter)
```python
import matplotlib.pyplot as plt
X_2d = umap.UMAP().fit_transform(X)
plt.figure(figsize=(10, 8))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', alpha=0.5, s=10)
plt.colorbar()
plt.title('UMAP projection')
plt.show()
```
### PCA for speed (vector DB preprocessing)
```python
from sklearn.decomposition import PCA
import faiss
# 매 매 768 의 OpenAI embedding → 매 256
embeddings = get_embeddings(documents)
pca = PCA(n_components=256)
reduced = pca.fit_transform(embeddings).astype('float32')
# 매 Faiss
index = faiss.IndexFlatIP(256)
index.add(reduced)
```
### Quantization (vector DB modern)
```python
import faiss
dim = 768
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFPQ(quantizer, dim, nlist=100, m=8, nbits=8)
# 매 8 byte 의 768-dim 의 represent — 매 매 100× compression.
index.train(embeddings_np)
index.add(embeddings_np)
```
### Word2Vec / CLIP-style (implicit reduction)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2') # 매 384-dim
embeddings = model.encode(sentences)
# 매 매 sentence (potentially infinite words) → 매 384-dim.
```
### Reconstruction error (anomaly)
```python
def detect_anomaly(model, X, threshold):
X_recon, _ = model(X)
error = ((X_recon - X) ** 2).mean(dim=1)
return error > threshold
```
### Choose dimension (elbow / cumvar)
```python
import numpy as np
import matplotlib.pyplot as plt
pca = PCA().fit(X_scaled)
cumvar = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumvar)
plt.xlabel('Component')
plt.ylabel('Cumulative variance')
plt.axhline(0.95, color='r', linestyle='--')
plt.show()
n_components = np.argmax(cumvar >= 0.95) + 1
```
### Manifold visualization comparison
```python
def viz_compare(X, labels):
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
for ax, (name, reducer) in zip(axes, [
('PCA', PCA(n_components=2)),
('t-SNE', TSNE(n_components=2, random_state=42)),
('UMAP', umap.UMAP(n_components=2, random_state=42)),
]):
proj = reducer.fit_transform(X)
ax.scatter(proj[:, 0], proj[:, 1], c=labels, cmap='tab10', s=5)
ax.set_title(name)
```
## 매 결정 기준
| 상황 | Method |
|---|---|
| Speed (preprocess) | PCA |
| Visualization | UMAP |
| Cluster preserve | UMAP |
| Variance interpret | PCA |
| Class-aware | LDA |
| Text → embedding | Sentence-transformer |
| Image → embedding | CLIP |
| Vector DB compress | PCA / PQ quantization |
| Anomaly | Autoencoder |
| Generative | VAE |
**기본값**: PCA (preprocess) + UMAP (viz) + embedding (semantic).
## 🔗 Graph
- 부모: [[Feature Engineering|Feature-Engineering]]
- 변형: [[PCA]] · [[t-SNE]] · [[UMAP]] · [[Autoencoder]] · [[VAE]]
- 응용: [[CLIP]] · [[Sentence-Transformers]] · [[Faiss]] · [[Anomaly-Detection]]
- Adjacent: [[Auto-Encoding]] · [[Bag of Words (BoW)]] · [[Bias vs Variance Trade-off]]
## 🤖 LLM 활용
**언제**: 매 visualization. 매 vector DB. 매 cluster preprocessing. 매 anomaly detection.
**언제 X**: 매 already low-dim. 매 lossless 필수.
## ❌ 안티패턴
- **PCA without standardize**: 매 wrong principal component.
- **t-SNE 의 cluster size 의 interpret**: 매 not preserved.
- **UMAP 의 distance 의 absolute interpret**: 매 local 만.
- **Too aggressive reduction**: 매 information loss.
- **Forget train-test split**: 매 leakage in PCA.
## 🧪 검증 / 중복
- Verified (Jolliffe PCA, van der Maaten t-SNE, McInnes UMAP).
- 신뢰도 A.
- Related: [[Auto-Encoding]] · [[Bag of Words (BoW)]] · [[CLIP]] · [[Sentence-Transformers]] · [[Bias vs Variance Trade-off]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — methods + 매 PCA / t-SNE / UMAP / AE / Faiss / quantization code |