--- id: wiki-2026-0508-k-means-clustering-foundations title: K-Means Clustering category: 10_Wiki/Topics status: verified canonical_id: self aliases: [k-means, clustering, k-means++, mini-batch, elbow method, silhouette] duplicate_of: none source_trust_level: A confidence_score: 0.97 verification_status: applied tags: [machine-learning, clustering, k-means, unsupervised, lloyd] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn / FAISS --- # K-Means Clustering ## 매 한 줄 > **"매 K centroid 의 의 의 minimize within-cluster variance"**. Lloyd 1957. 매 simple, fast, scalable. 매 limitations: 매 spherical assumption, K 의 specify, local optimum. 매 modern: 매 k-means++, mini-batch, FAISS-based for billion-scale. ## 매 핵심 ### 매 algorithm (Lloyd) 1. 매 K centroids 의 init. 2. **Assign**: 매 매 point 의 closest centroid. 3. **Update**: 매 centroid = mean. 4. 매 1-2 의 converge 의 의 의 repeat. ### 매 init - **Random**: 매 worst. - **k-means++** (Arthur 2007): 매 spread out. - **Forgy**: 매 random K points. ### 매 K selection - **Elbow** method. - **Silhouette** score. - **Gap statistic**. - **BIC / AIC** (Gaussian Mixture). ### 매 응용 1. **Customer segmentation**. 2. **Image quantization** (color palette). 3. **Anomaly** (distance from centroid). 4. **Document clustering**. 5. **Vector index** (FAISS IVF). ## 💻 패턴 ### sklearn k-means ```python from sklearn.cluster import KMeans km = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=0).fit(X) labels = km.labels_ centroids = km.cluster_centers_ ``` ### Mini-batch (faster) ```python from sklearn.cluster import MiniBatchKMeans km = MiniBatchKMeans(n_clusters=100, batch_size=1024).fit(X) ``` ### Elbow method ```python import matplotlib.pyplot as plt inertias = [] ks = range(1, 15) for k in ks: km = KMeans(n_clusters=k, n_init=10).fit(X) inertias.append(km.inertia_) plt.plot(ks, inertias, 'o-') plt.xlabel('K'); plt.ylabel('Inertia') # 매 elbow point = 매 best K ``` ### Silhouette ```python from sklearn.metrics import silhouette_score for k in [3, 4, 5, 6, 7]: labels = KMeans(n_clusters=k, n_init=10).fit_predict(X) print(k, silhouette_score(X, labels)) # 매 closer to 1 = 매 better ``` ### k-means++ init (manual) ```python import numpy as np def kmeans_pp_init(X, k): centers = [X[np.random.randint(len(X))]] for _ in range(k - 1): d2 = np.array([min(np.linalg.norm(x - c) ** 2 for c in centers) for x in X]) probs = d2 / d2.sum() cumprob = probs.cumsum() idx = np.searchsorted(cumprob, np.random.rand()) centers.append(X[idx]) return np.array(centers) ``` ### Custom Lloyd (educational) ```python def kmeans_lloyd(X, k, max_iter=100): centers = X[np.random.choice(len(X), k, replace=False)] for _ in range(max_iter): # 매 assign dists = np.linalg.norm(X[:, None] - centers, axis=2) labels = dists.argmin(axis=1) # 매 update new_centers = np.array([X[labels == i].mean(axis=0) for i in range(k)]) if np.allclose(centers, new_centers): break centers = new_centers return labels, centers ``` ### FAISS k-means (large-scale) ```python import faiss d = X.shape[1] kmeans = faiss.Kmeans(d, k=100, niter=20, gpu=True) kmeans.train(X.astype('float32')) centroids = kmeans.centroids _, labels = kmeans.index.search(X.astype('float32'), 1) ``` ### Image color quantization ```python def quantize_image(img, k=8): pixels = img.reshape(-1, 3) km = KMeans(n_clusters=k, n_init=3).fit(pixels) quantized = km.cluster_centers_[km.labels_] return quantized.reshape(img.shape).astype('uint8') ``` ### Anomaly via distance ```python def detect_anomaly(X, km, threshold=None): dists = np.linalg.norm(X - km.cluster_centers_[km.predict(X)], axis=1) if threshold is None: threshold = np.percentile(dists, 99) return dists > threshold ``` ### Spherical k-means (text, cosine) ```python def spherical_kmeans(X, k, max_iter=100): """매 normalize → k-means 의 cosine equivalent.""" X_norm = X / np.linalg.norm(X, axis=1, keepdims=True) return KMeans(n_clusters=k).fit(X_norm) ``` ### Gaussian Mixture (alternative) ```python from sklearn.mixture import GaussianMixture gmm = GaussianMixture(n_components=5, covariance_type='full').fit(X) labels = gmm.predict(X) # 매 vs k-means: 매 ellipsoidal cluster + soft assignment ``` ### Scaling (always) ```python from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(X) km = KMeans(n_clusters=5).fit(X_scaled) ``` ### Dimensionality reduction first (high-D) ```python from sklearn.decomposition import PCA X_reduced = PCA(n_components=50).fit_transform(X) km = KMeans(n_clusters=5).fit(X_reduced) ``` ### Initialize from labels (semi-supervised) ```python init_centers = np.array([X[y == c].mean(axis=0) for c in np.unique(y)]) km = KMeans(n_clusters=len(init_centers), init=init_centers, n_init=1).fit(X) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Small N | sklearn | | Large N | MiniBatch | | Massive N | FAISS | | Image | Color quantize | | Text | Spherical (normalized) | | Non-spherical | GMM / DBSCAN | **기본값**: 매 scale + k-means++ + 매 multiple n_init + 매 elbow / silhouette for K. 매 large = MiniBatch / FAISS. ## 🔗 Graph - 부모: [[Clustering]] - 변형: [[k-means++]] - Adjacent: [[K-Nearest-Neighbors-K-NN]] ## 🤖 LLM 활용 **언제**: 매 segmentation. 매 EDA. 매 vector index. **언제 X**: 매 non-spherical / density-varying (use DBSCAN). ## ❌ 안티패턴 - **No scaling**: 매 dominant feature. - **K=2 default**: 매 wrong. - **Random init**: 매 use k-means++. - **K-means on non-spherical**: 매 wrong. ## 🧪 검증 / 중복 - Verified (Lloyd 1957, Arthur k-means++ 2007, FAISS docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Lloyd / ++/MiniBatch + 매 elbow / silhouette / FAISS / quantize code |