--- id: wiki-2026-0508-dimensionality-reduction title: Dimensionality Reduction category: 10_Wiki/Topics status: verified canonical_id: self aliases: [PCA, t-SNE, UMAP, autoencoder, curse of dimensionality, feature extraction] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [dimensionality-reduction, pca, tsne, umap, autoencoder, visualization, manifold-learning] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn / umap-learn / PyTorch --- # Dimensionality Reduction ## 매 한 줄 > **"매 high-dim 의 essence 의 low-dim"**. 매 PCA (linear) → 매 t-SNE / UMAP (nonlinear, 시각화) → 매 Autoencoder / VAE (deep). 매 modern: 매 embedding (CLIP, sentence-transformers) 의 implicit dim reduction. ## 매 핵심 method ### Linear #### PCA (Principal Component Analysis) - 매 variance 의 maximum direction. - 매 orthogonal axis. - 매 SVD. - 매 fast + interpretable. #### LDA (Linear Discriminant Analysis) - 매 class separation 의 maximize. - 매 supervised. #### Factor Analysis - 매 latent factor 의 explain variance. ### Nonlinear (manifold) #### t-SNE (Maaten 2008) - 매 local neighborhood 의 preserve. - 매 visualization 강. - 매 global structure 의 weak. - 매 stochastic. #### UMAP (McInnes 2018) - 매 t-SNE 의 successor. - 매 faster + 매 global structure 도 better. - 매 default for high-dim viz. #### Isomap - 매 geodesic distance 의 preserve. #### LLE (Locally Linear Embedding). ### Neural #### Autoencoder - 매 bottleneck 의 dim reduce. #### VAE (Variational AE) - 매 probabilistic. #### Self-Supervised Embedding - 매 CLIP, BERT, sentence-transformers. - 매 implicit dim reduction. ### 매 PaCMAP / TriMap (recent) - 매 UMAP 의 variant. - 매 better global structure. ### 매 응용 1. **Visualization** (2D / 3D): 매 t-SNE, UMAP. 2. **Speed** (preprocess): 매 PCA. 3. **Anomaly detection**: 매 autoencoder. 4. **Feature extraction**: 매 embedding. 5. **Compression**: 매 quantization + 매 embed. 6. **Clustering preprocessing**. 7. **RAG** (vector DB): 매 PCA / quantization. ### 매 curse of dimensionality - 매 distance 의 meaningless. - 매 sparsity in 매 high-dim. - 매 sample requirement 의 exponential. ## 💻 패턴 ### PCA (sklearn) ```python from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(X) pca = PCA(n_components=0.95) # 매 95% variance 의 keep X_reduced = pca.fit_transform(X_scaled) print(f'Original: {X.shape[1]}, reduced: {pca.n_components_}') print(f'Explained variance: {pca.explained_variance_ratio_.cumsum()}') ``` ### t-SNE ```python from sklearn.manifold import TSNE tsne = TSNE( n_components=2, perplexity=30, n_iter=1000, random_state=42, ) X_2d = tsne.fit_transform(X[:5000]) # 매 t-SNE 의 slow → 매 sample ``` ### UMAP (modern) ```python import umap reducer = umap.UMAP( n_components=2, n_neighbors=15, min_dist=0.1, metric='cosine', # 매 embedding 에 좋음 random_state=42, ) X_2d = reducer.fit_transform(X) ``` ### Autoencoder (PyTorch) ```python import torch.nn as nn class AE(nn.Module): def __init__(self, input_dim, latent_dim=32): super().__init__() self.encoder = nn.Sequential( nn.Linear(input_dim, 128), nn.ReLU(), nn.Linear(128, 64), nn.ReLU(), nn.Linear(64, latent_dim), ) self.decoder = nn.Sequential( nn.Linear(latent_dim, 64), nn.ReLU(), nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, input_dim), ) def forward(self, x): z = self.encoder(x) return self.decoder(z), z # 매 latent 의 use model = AE(input_dim=784) # ... train ... _, latent = model(X_test) ``` ### Visualization combo (UMAP + scatter) ```python import matplotlib.pyplot as plt X_2d = umap.UMAP().fit_transform(X) plt.figure(figsize=(10, 8)) plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', alpha=0.5, s=10) plt.colorbar() plt.title('UMAP projection') plt.show() ``` ### PCA for speed (vector DB preprocessing) ```python from sklearn.decomposition import PCA import faiss # 매 매 768 의 OpenAI embedding → 매 256 embeddings = get_embeddings(documents) pca = PCA(n_components=256) reduced = pca.fit_transform(embeddings).astype('float32') # 매 Faiss index = faiss.IndexFlatIP(256) index.add(reduced) ``` ### Quantization (vector DB modern) ```python import faiss dim = 768 quantizer = faiss.IndexFlatIP(dim) index = faiss.IndexIVFPQ(quantizer, dim, nlist=100, m=8, nbits=8) # 매 8 byte 의 768-dim 의 represent — 매 매 100× compression. index.train(embeddings_np) index.add(embeddings_np) ``` ### Word2Vec / CLIP-style (implicit reduction) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # 매 384-dim embeddings = model.encode(sentences) # 매 매 sentence (potentially infinite words) → 매 384-dim. ``` ### Reconstruction error (anomaly) ```python def detect_anomaly(model, X, threshold): X_recon, _ = model(X) error = ((X_recon - X) ** 2).mean(dim=1) return error > threshold ``` ### Choose dimension (elbow / cumvar) ```python import numpy as np import matplotlib.pyplot as plt pca = PCA().fit(X_scaled) cumvar = np.cumsum(pca.explained_variance_ratio_) plt.plot(cumvar) plt.xlabel('Component') plt.ylabel('Cumulative variance') plt.axhline(0.95, color='r', linestyle='--') plt.show() n_components = np.argmax(cumvar >= 0.95) + 1 ``` ### Manifold visualization comparison ```python def viz_compare(X, labels): fig, axes = plt.subplots(1, 3, figsize=(20, 6)) for ax, (name, reducer) in zip(axes, [ ('PCA', PCA(n_components=2)), ('t-SNE', TSNE(n_components=2, random_state=42)), ('UMAP', umap.UMAP(n_components=2, random_state=42)), ]): proj = reducer.fit_transform(X) ax.scatter(proj[:, 0], proj[:, 1], c=labels, cmap='tab10', s=5) ax.set_title(name) ``` ## 매 결정 기준 | 상황 | Method | |---|---| | Speed (preprocess) | PCA | | Visualization | UMAP | | Cluster preserve | UMAP | | Variance interpret | PCA | | Class-aware | LDA | | Text → embedding | Sentence-transformer | | Image → embedding | CLIP | | Vector DB compress | PCA / PQ quantization | | Anomaly | Autoencoder | | Generative | VAE | **기본값**: PCA (preprocess) + UMAP (viz) + embedding (semantic). ## 🔗 Graph - 부모: [[Feature Engineering|Feature-Engineering]] - 변형: [[PCA]] · [[t-SNE]] · [[UMAP]] · [[Autoencoder]] · [[VAE]] - 응용: [[CLIP]] · [[Sentence-Transformers]] · [[Faiss]] · [[Anomaly-Detection]] - Adjacent: [[Auto-Encoding]] · [[Bag of Words (BoW)]] · [[Bias-vs-Variance]] ## 🤖 LLM 활용 **언제**: 매 visualization. 매 vector DB. 매 cluster preprocessing. 매 anomaly detection. **언제 X**: 매 already low-dim. 매 lossless 필수. ## ❌ 안티패턴 - **PCA without standardize**: 매 wrong principal component. - **t-SNE 의 cluster size 의 interpret**: 매 not preserved. - **UMAP 의 distance 의 absolute interpret**: 매 local 만. - **Too aggressive reduction**: 매 information loss. - **Forget train-test split**: 매 leakage in PCA. ## 🧪 검증 / 중복 - Verified (Jolliffe PCA, van der Maaten t-SNE, McInnes UMAP). - 신뢰도 A. - Related: [[Auto-Encoding]] · [[Bag of Words (BoW)]] · [[CLIP]] · [[Sentence-Transformers]] · [[Bias-vs-Variance]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — methods + 매 PCA / t-SNE / UMAP / AE / Faiss / quantization code |