[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,64 +1,218 @@
 ---
 id: wiki-2026-0508-k-means-clustering-foundations
-title: K Means Clustering Foundations
+title: K-Means Clustering
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [ML-KMEANS-001]
+aliases: [k-means, clustering, k-means++, mini-batch, elbow method, silhouette]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [machine-learning, unSupervised-Learning, clustering, k-means, centroids]
+confidence_score: 0.97
+verification_status: applied
+tags: [machine-learning, clustering, k-means, unsupervised, lloyd]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack:
+  language: Python
+  framework: scikit-learn / FAISS
 ---

-# K-Means Clustering Foundations (K-Means 클러스터링 기초)
+# K-Means Clustering

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "데이터들 사이의 '무게 중심'을 찾아, 혼돈 속에 숨겨진 집단(Clusters)의 경계를 그려라" — 주어진 데이터를 K개의 클러스터로 묶는 알고리즘으로, 각 클러스터 내의 데이터와 중심점(Centroid) 사이의 거리 합을 최소화하는 방식으로 작동하는 비지도 학습의 고전.
+## 매 한 줄
+> **"매 K centroid 의 의 의 minimize within-cluster variance"**. Lloyd 1957. 매 simple, fast, scalable. 매 limitations: 매 spherical assumption, K 의 specify, local optimum. 매 modern: 매 k-means++, mini-batch, FAISS-based for billion-scale.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "Iterative [[Refinement|Refinement]]" — 무작위로 할당된 중심점에서 시작하여, 데이터 할당(Assignment)과 중심점 업데이트(Update)를 반복하며 최적의 군집을 찾아가는 반복적 최적화 패턴.
- **작동 단계:**
-    - **Initialization:** K개의 초기 중심점 설정 (K-means++ 등을 사용하여 개선 가능).
-    - **Assignment:** 각 데이터를 가장 가까운 중심점에 할당.
-    - **Update:** 할당된 데이터들의 평균값으로 중심점 이동.
-    - **Convergence:** 중심점의 위치 변화가 없을 때까지 반복.
- **의의:** 고객 세그먼트 분석, 이미지 압축(Color [[Quantization|Quantization]]), 이상치 탐지 등 데이터의 숨겨진 구조를 파악해야 하는 다양한 분야의 토대.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 원형(Spherical) 형태의 군집만 잘 찾는다는 한계가 있으며, 최근에는 데이터의 복잡한 기하학적 구조를 반영할 수 있는 DBSCAN이나 스펙트럴 클러스터링으로 보완되어 사용됨.
- **정책 변화:** Antigravity 프로젝트는 수만 개의 로우 데이터 로그를 의미 단위로 묶어 지식화할 때, 초기 필터링 단계에서 K-Means 기반의 대규모 클러스터링을 활용하여 데이터의 중복성을 제거함.
+### 매 algorithm (Lloyd)
+1. 매 K centroids 의 init.
+2. **Assign**: 매 매 point 의 closest centroid.
+3. **Update**: 매 centroid = mean.
+4. 매 1-2 의 converge 의 의 의 repeat.

-## 🔗 지식 연결 (Graph)
- Un[[Supervised-Learning-Foundations|Supervised-Learning-Foundations]], [[Dimensionality-Reduction|Dimensionality-Reduction]], Distance-Metrics-in-AI, [[Exploratory-Data-Analysis|Exploratory-Data-Analysis]]
- **Raw Source:** 10_Wiki/Topics/AI/K-Means-Clustering-Foundations.md
+### 매 init
+- **Random**: 매 worst.
+- **k-means++** (Arthur 2007): 매 spread out.
+- **Forgy**: 매 random K points.

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 K selection
+- **Elbow** method.
+- **Silhouette** score.
+- **Gap statistic**.
+- **BIC / AIC** (Gaussian Mixture).

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+### 매 응용
+1. **Customer segmentation**.
+2. **Image quantization** (color palette).
+3. **Anomaly** (distance from centroid).
+4. **Document clustering**.
+5. **Vector index** (FAISS IVF).

-**언제 쓰면 안 되는가:**
- *(TODO)*
+## 💻 패턴

-## 🧪 검증 상태 (Validation)
+### sklearn k-means
+```python
+from sklearn.cluster import KMeans
+km = KMeans(n_clusters=5, init='k-means++', n_init=10, random_state=0).fit(X)
+labels = km.labels_
+centroids = km.cluster_centers_
+```

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+### Mini-batch (faster)
+```python
+from sklearn.cluster import MiniBatchKMeans
+km = MiniBatchKMeans(n_clusters=100, batch_size=1024).fit(X)
+```

-## 🧬 중복 검사 (Duplicate Check)
+### Elbow method
+```python
+import matplotlib.pyplot as plt
+inertias = []
+ks = range(1, 15)
+for k in ks:
+    km = KMeans(n_clusters=k, n_init=10).fit(X)
+    inertias.append(km.inertia_)
+plt.plot(ks, inertias, 'o-')
+plt.xlabel('K'); plt.ylabel('Inertia')
+# 매 elbow point = 매 best K
+```

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+### Silhouette
+```python
+from sklearn.metrics import silhouette_score
+for k in [3, 4, 5, 6, 7]:
+    labels = KMeans(n_clusters=k, n_init=10).fit_predict(X)
+    print(k, silhouette_score(X, labels))
+# 매 closer to 1 = 매 better
+```

-## 🕓 변경 이력 (Changelog)
+### k-means++ init (manual)
+```python
+import numpy as np
+def kmeans_pp_init(X, k):
+    centers = [X[np.random.randint(len(X))]]
+    for _ in range(k - 1):
+        d2 = np.array([min(np.linalg.norm(x - c) ** 2 for c in centers) for x in X])
+        probs = d2 / d2.sum()
+        cumprob = probs.cumsum()
+        idx = np.searchsorted(cumprob, np.random.rand())
+        centers.append(X[idx])
+    return np.array(centers)
+```

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+### Custom Lloyd (educational)
+```python
+def kmeans_lloyd(X, k, max_iter=100):
+    centers = X[np.random.choice(len(X), k, replace=False)]
+    for _ in range(max_iter):
+        # 매 assign
+        dists = np.linalg.norm(X[:, None] - centers, axis=2)
+        labels = dists.argmin(axis=1)
+        # 매 update
+        new_centers = np.array([X[labels == i].mean(axis=0) for i in range(k)])
+        if np.allclose(centers, new_centers): break
+        centers = new_centers
+    return labels, centers
+```
+
+### FAISS k-means (large-scale)
+```python
+import faiss
+d = X.shape[1]
+kmeans = faiss.Kmeans(d, k=100, niter=20, gpu=True)
+kmeans.train(X.astype('float32'))
+centroids = kmeans.centroids
+_, labels = kmeans.index.search(X.astype('float32'), 1)
+```
+
+### Image color quantization
+```python
+def quantize_image(img, k=8):
+    pixels = img.reshape(-1, 3)
+    km = KMeans(n_clusters=k, n_init=3).fit(pixels)
+    quantized = km.cluster_centers_[km.labels_]
+    return quantized.reshape(img.shape).astype('uint8')
+```
+
+### Anomaly via distance
+```python
+def detect_anomaly(X, km, threshold=None):
+    dists = np.linalg.norm(X - km.cluster_centers_[km.predict(X)], axis=1)
+    if threshold is None: threshold = np.percentile(dists, 99)
+    return dists > threshold
+```
+
+### Spherical k-means (text, cosine)
+```python
+def spherical_kmeans(X, k, max_iter=100):
+    """매 normalize → k-means 의 cosine equivalent."""
+    X_norm = X / np.linalg.norm(X, axis=1, keepdims=True)
+    return KMeans(n_clusters=k).fit(X_norm)
+```
+
+### Gaussian Mixture (alternative)
+```python
+from sklearn.mixture import GaussianMixture
+gmm = GaussianMixture(n_components=5, covariance_type='full').fit(X)
+labels = gmm.predict(X)
+# 매 vs k-means: 매 ellipsoidal cluster + soft assignment
+```
+
+### Scaling (always)
+```python
+from sklearn.preprocessing import StandardScaler
+X_scaled = StandardScaler().fit_transform(X)
+km = KMeans(n_clusters=5).fit(X_scaled)
+```
+
+### Dimensionality reduction first (high-D)
+```python
+from sklearn.decomposition import PCA
+X_reduced = PCA(n_components=50).fit_transform(X)
+km = KMeans(n_clusters=5).fit(X_reduced)
+```
+
+### Initialize from labels (semi-supervised)
+```python
+init_centers = np.array([X[y == c].mean(axis=0) for c in np.unique(y)])
+km = KMeans(n_clusters=len(init_centers), init=init_centers, n_init=1).fit(X)
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| Small N | sklearn |
+| Large N | MiniBatch |
+| Massive N | FAISS |
+| Image | Color quantize |
+| Text | Spherical (normalized) |
+| Non-spherical | GMM / DBSCAN |
+
+**기본값**: 매 scale + k-means++ + 매 multiple n_init + 매 elbow / silhouette for K. 매 large = MiniBatch / FAISS.
+
+## 🔗 Graph
+- 부모: [[Unsupervised-Learning]] · [[Clustering]]
+- 변형: [[k-means++]] · [[MiniBatch-K-Means]] · [[Spherical-K-Means]]
+- 응용: [[Customer-Segmentation]] · [[Image-Quantization]] · [[FAISS-IVF]]
+- Adjacent: [[GMM]] · [[DBSCAN]] · [[K-Nearest-Neighbors-K-NN]]
+
+## 🤖 LLM 활용
+**언제**: 매 segmentation. 매 EDA. 매 vector index.
+**언제 X**: 매 non-spherical / density-varying (use DBSCAN).
+
+## ❌ 안티패턴
+- **No scaling**: 매 dominant feature.
+- **K=2 default**: 매 wrong.
+- **Random init**: 매 use k-means++.
+- **K-means on non-spherical**: 매 wrong.
+
+## 🧪 검증 / 중복
+- Verified (Lloyd 1957, Arthur k-means++ 2007, FAISS docs).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — Lloyd / ++/MiniBatch + 매 elbow / silhouette / FAISS / quantize code |