f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
218 lines
6.0 KiB
Markdown
218 lines
6.0 KiB
Markdown
---
|
|
id: wiki-2026-0508-k-means-clustering-foundations
|
|
title: K-Means Clustering
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [k-means, clustering, k-means++, mini-batch, elbow method, silhouette]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.97
|
|
verification_status: applied
|
|
tags: [machine-learning, clustering, k-means, unsupervised, lloyd]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: scikit-learn / FAISS
|
|
---
|
|
|
|
# K-Means Clustering
|
|
|
|
## 매 한 줄
|
|
> **"매 K centroid 의 의 의 minimize within-cluster variance"**. Lloyd 1957. 매 simple, fast, scalable. 매 limitations: 매 spherical assumption, K 의 specify, local optimum. 매 modern: 매 k-means++, mini-batch, FAISS-based for billion-scale.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 algorithm (Lloyd)
|
|
1. 매 K centroids 의 init.
|
|
2. **Assign**: 매 매 point 의 closest centroid.
|
|
3. **Update**: 매 centroid = mean.
|
|
4. 매 1-2 의 converge 의 의 의 repeat.
|
|
|
|
### 매 init
|
|
- **Random**: 매 worst.
|
|
- **k-means++** (Arthur 2007): 매 spread out.
|
|
- **Forgy**: 매 random K points.
|
|
|
|
### 매 K selection
|
|
- **Elbow** method.
|
|
- **Silhouette** score.
|
|
- **Gap statistic**.
|
|
- **BIC / AIC** (Gaussian Mixture).
|
|
|
|
### 매 응용
|
|
1. **Customer segmentation**.
|
|
2. **Image quantization** (color palette).
|
|
3. **Anomaly** (distance from centroid).
|
|
4. **Document clustering**.
|
|
5. **Vector index** (FAISS IVF).
|
|
|
|
## 💻 패턴
|
|
|
|
### sklearn k-means
|
|
```python
|
|
from sklearn.cluster import KMeans
|
|
km = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=0).fit(X)
|
|
labels = km.labels_
|
|
centroids = km.cluster_centers_
|
|
```
|
|
|
|
### Mini-batch (faster)
|
|
```python
|
|
from sklearn.cluster import MiniBatchKMeans
|
|
km = MiniBatchKMeans(n_clusters=100, batch_size=1024).fit(X)
|
|
```
|
|
|
|
### Elbow method
|
|
```python
|
|
import matplotlib.pyplot as plt
|
|
inertias = []
|
|
ks = range(1, 15)
|
|
for k in ks:
|
|
km = KMeans(n_clusters=k, n_init=10).fit(X)
|
|
inertias.append(km.inertia_)
|
|
plt.plot(ks, inertias, 'o-')
|
|
plt.xlabel('K'); plt.ylabel('Inertia')
|
|
# 매 elbow point = 매 best K
|
|
```
|
|
|
|
### Silhouette
|
|
```python
|
|
from sklearn.metrics import silhouette_score
|
|
for k in [3, 4, 5, 6, 7]:
|
|
labels = KMeans(n_clusters=k, n_init=10).fit_predict(X)
|
|
print(k, silhouette_score(X, labels))
|
|
# 매 closer to 1 = 매 better
|
|
```
|
|
|
|
### k-means++ init (manual)
|
|
```python
|
|
import numpy as np
|
|
def kmeans_pp_init(X, k):
|
|
centers = [X[np.random.randint(len(X))]]
|
|
for _ in range(k - 1):
|
|
d2 = np.array([min(np.linalg.norm(x - c) ** 2 for c in centers) for x in X])
|
|
probs = d2 / d2.sum()
|
|
cumprob = probs.cumsum()
|
|
idx = np.searchsorted(cumprob, np.random.rand())
|
|
centers.append(X[idx])
|
|
return np.array(centers)
|
|
```
|
|
|
|
### Custom Lloyd (educational)
|
|
```python
|
|
def kmeans_lloyd(X, k, max_iter=100):
|
|
centers = X[np.random.choice(len(X), k, replace=False)]
|
|
for _ in range(max_iter):
|
|
# 매 assign
|
|
dists = np.linalg.norm(X[:, None] - centers, axis=2)
|
|
labels = dists.argmin(axis=1)
|
|
# 매 update
|
|
new_centers = np.array([X[labels == i].mean(axis=0) for i in range(k)])
|
|
if np.allclose(centers, new_centers): break
|
|
centers = new_centers
|
|
return labels, centers
|
|
```
|
|
|
|
### FAISS k-means (large-scale)
|
|
```python
|
|
import faiss
|
|
d = X.shape[1]
|
|
kmeans = faiss.Kmeans(d, k=100, niter=20, gpu=True)
|
|
kmeans.train(X.astype('float32'))
|
|
centroids = kmeans.centroids
|
|
_, labels = kmeans.index.search(X.astype('float32'), 1)
|
|
```
|
|
|
|
### Image color quantization
|
|
```python
|
|
def quantize_image(img, k=8):
|
|
pixels = img.reshape(-1, 3)
|
|
km = KMeans(n_clusters=k, n_init=3).fit(pixels)
|
|
quantized = km.cluster_centers_[km.labels_]
|
|
return quantized.reshape(img.shape).astype('uint8')
|
|
```
|
|
|
|
### Anomaly via distance
|
|
```python
|
|
def detect_anomaly(X, km, threshold=None):
|
|
dists = np.linalg.norm(X - km.cluster_centers_[km.predict(X)], axis=1)
|
|
if threshold is None: threshold = np.percentile(dists, 99)
|
|
return dists > threshold
|
|
```
|
|
|
|
### Spherical k-means (text, cosine)
|
|
```python
|
|
def spherical_kmeans(X, k, max_iter=100):
|
|
"""매 normalize → k-means 의 cosine equivalent."""
|
|
X_norm = X / np.linalg.norm(X, axis=1, keepdims=True)
|
|
return KMeans(n_clusters=k).fit(X_norm)
|
|
```
|
|
|
|
### Gaussian Mixture (alternative)
|
|
```python
|
|
from sklearn.mixture import GaussianMixture
|
|
gmm = GaussianMixture(n_components=5, covariance_type='full').fit(X)
|
|
labels = gmm.predict(X)
|
|
# 매 vs k-means: 매 ellipsoidal cluster + soft assignment
|
|
```
|
|
|
|
### Scaling (always)
|
|
```python
|
|
from sklearn.preprocessing import StandardScaler
|
|
X_scaled = StandardScaler().fit_transform(X)
|
|
km = KMeans(n_clusters=5).fit(X_scaled)
|
|
```
|
|
|
|
### Dimensionality reduction first (high-D)
|
|
```python
|
|
from sklearn.decomposition import PCA
|
|
X_reduced = PCA(n_components=50).fit_transform(X)
|
|
km = KMeans(n_clusters=5).fit(X_reduced)
|
|
```
|
|
|
|
### Initialize from labels (semi-supervised)
|
|
```python
|
|
init_centers = np.array([X[y == c].mean(axis=0) for c in np.unique(y)])
|
|
km = KMeans(n_clusters=len(init_centers), init=init_centers, n_init=1).fit(X)
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Small N | sklearn |
|
|
| Large N | MiniBatch |
|
|
| Massive N | FAISS |
|
|
| Image | Color quantize |
|
|
| Text | Spherical (normalized) |
|
|
| Non-spherical | GMM / DBSCAN |
|
|
|
|
**기본값**: 매 scale + k-means++ + 매 multiple n_init + 매 elbow / silhouette for K. 매 large = MiniBatch / FAISS.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Clustering]]
|
|
- 변형: [[k-means++]]
|
|
- Adjacent: [[K-Nearest-Neighbors-K-NN]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 segmentation. 매 EDA. 매 vector index.
|
|
**언제 X**: 매 non-spherical / density-varying (use DBSCAN).
|
|
|
|
## ❌ 안티패턴
|
|
- **No scaling**: 매 dominant feature.
|
|
- **K=2 default**: 매 wrong.
|
|
- **Random init**: 매 use k-means++.
|
|
- **K-means on non-spherical**: 매 wrong.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Lloyd 1957, Arthur k-means++ 2007, FAISS docs).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — Lloyd / ++/MiniBatch + 매 elbow / silhouette / FAISS / quantize code |
|