Files
2nd/10_Wiki/Topics/AI_and_ML/K-Means-Clustering-Foundations.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.0 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-k-means-clustering-foundations K-Means Clustering 10_Wiki/Topics verified self
k-means
clustering
k-means++
mini-batch
elbow method
silhouette
none A 0.97 applied
machine-learning
clustering
k-means
unsupervised
lloyd
2026-05-10 pending
language framework
Python scikit-learn / FAISS

K-Means Clustering

매 한 줄

"매 K centroid 의 의 의 minimize within-cluster variance". Lloyd 1957. 매 simple, fast, scalable. 매 limitations: 매 spherical assumption, K 의 specify, local optimum. 매 modern: 매 k-means++, mini-batch, FAISS-based for billion-scale.

매 핵심

매 algorithm (Lloyd)

  1. 매 K centroids 의 init.
  2. Assign: 매 매 point 의 closest centroid.
  3. Update: 매 centroid = mean.
  4. 매 1-2 의 converge 의 의 의 repeat.

매 init

  • Random: 매 worst.
  • k-means++ (Arthur 2007): 매 spread out.
  • Forgy: 매 random K points.

매 K selection

  • Elbow method.
  • Silhouette score.
  • Gap statistic.
  • BIC / AIC (Gaussian Mixture).

매 응용

  1. Customer segmentation.
  2. Image quantization (color palette).
  3. Anomaly (distance from centroid).
  4. Document clustering.
  5. Vector index (FAISS IVF).

💻 패턴

sklearn k-means

from sklearn.cluster import KMeans
km = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=0).fit(X)
labels = km.labels_
centroids = km.cluster_centers_

Mini-batch (faster)

from sklearn.cluster import MiniBatchKMeans
km = MiniBatchKMeans(n_clusters=100, batch_size=1024).fit(X)

Elbow method

import matplotlib.pyplot as plt
inertias = []
ks = range(1, 15)
for k in ks:
    km = KMeans(n_clusters=k, n_init=10).fit(X)
    inertias.append(km.inertia_)
plt.plot(ks, inertias, 'o-')
plt.xlabel('K'); plt.ylabel('Inertia')
# 매 elbow point = 매 best K

Silhouette

from sklearn.metrics import silhouette_score
for k in [3, 4, 5, 6, 7]:
    labels = KMeans(n_clusters=k, n_init=10).fit_predict(X)
    print(k, silhouette_score(X, labels))
# 매 closer to 1 = 매 better

k-means++ init (manual)

import numpy as np
def kmeans_pp_init(X, k):
    centers = [X[np.random.randint(len(X))]]
    for _ in range(k - 1):
        d2 = np.array([min(np.linalg.norm(x - c) ** 2 for c in centers) for x in X])
        probs = d2 / d2.sum()
        cumprob = probs.cumsum()
        idx = np.searchsorted(cumprob, np.random.rand())
        centers.append(X[idx])
    return np.array(centers)

Custom Lloyd (educational)

def kmeans_lloyd(X, k, max_iter=100):
    centers = X[np.random.choice(len(X), k, replace=False)]
    for _ in range(max_iter):
        # 매 assign
        dists = np.linalg.norm(X[:, None] - centers, axis=2)
        labels = dists.argmin(axis=1)
        # 매 update
        new_centers = np.array([X[labels == i].mean(axis=0) for i in range(k)])
        if np.allclose(centers, new_centers): break
        centers = new_centers
    return labels, centers

FAISS k-means (large-scale)

import faiss
d = X.shape[1]
kmeans = faiss.Kmeans(d, k=100, niter=20, gpu=True)
kmeans.train(X.astype('float32'))
centroids = kmeans.centroids
_, labels = kmeans.index.search(X.astype('float32'), 1)

Image color quantization

def quantize_image(img, k=8):
    pixels = img.reshape(-1, 3)
    km = KMeans(n_clusters=k, n_init=3).fit(pixels)
    quantized = km.cluster_centers_[km.labels_]
    return quantized.reshape(img.shape).astype('uint8')

Anomaly via distance

def detect_anomaly(X, km, threshold=None):
    dists = np.linalg.norm(X - km.cluster_centers_[km.predict(X)], axis=1)
    if threshold is None: threshold = np.percentile(dists, 99)
    return dists > threshold

Spherical k-means (text, cosine)

def spherical_kmeans(X, k, max_iter=100):
    """매 normalize → k-means 의 cosine equivalent."""
    X_norm = X / np.linalg.norm(X, axis=1, keepdims=True)
    return KMeans(n_clusters=k).fit(X_norm)

Gaussian Mixture (alternative)

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=5, covariance_type='full').fit(X)
labels = gmm.predict(X)
# 매 vs k-means: 매 ellipsoidal cluster + soft assignment

Scaling (always)

from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
km = KMeans(n_clusters=5).fit(X_scaled)

Dimensionality reduction first (high-D)

from sklearn.decomposition import PCA
X_reduced = PCA(n_components=50).fit_transform(X)
km = KMeans(n_clusters=5).fit(X_reduced)

Initialize from labels (semi-supervised)

init_centers = np.array([X[y == c].mean(axis=0) for c in np.unique(y)])
km = KMeans(n_clusters=len(init_centers), init=init_centers, n_init=1).fit(X)

매 결정 기준

상황 Approach
Small N sklearn
Large N MiniBatch
Massive N FAISS
Image Color quantize
Text Spherical (normalized)
Non-spherical GMM / DBSCAN

기본값: 매 scale + k-means++ + 매 multiple n_init + 매 elbow / silhouette for K. 매 large = MiniBatch / FAISS.

🔗 Graph

🤖 LLM 활용

언제: 매 segmentation. 매 EDA. 매 vector index. 언제 X: 매 non-spherical / density-varying (use DBSCAN).

안티패턴

  • No scaling: 매 dominant feature.
  • K=2 default: 매 wrong.
  • Random init: 매 use k-means++.
  • K-means on non-spherical: 매 wrong.

🧪 검증 / 중복

  • Verified (Lloyd 1957, Arthur k-means++ 2007, FAISS docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Lloyd / ++/MiniBatch + 매 elbow / silhouette / FAISS / quantize code