K-Means聚类算法:从原理到实战 K-Means是最经典的无监督学习算法之一,用于将数据划分为K个簇,使簇内方差最小化。
算法原理 K-Means的目标是最小化簇内平方误差和:
$$J = \sum_{k=1}^{K}\sum_{\mathbf{x}_i \in C_k} |\mathbf{x}_i - \boldsymbol{\mu}_k|^2$$
算法步骤:
随机初始化K个聚类中心
将每个样本分配到最近的聚类中心
更新聚类中心为簇内样本均值
重复步骤2-3直到收敛
手动实现K-Means 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 import numpy as npimport matplotlib.pyplot as pltclass KMeans : def __init__ (self, n_clusters=3 , max_iters=100 , random_state=42 ): self.n_clusters = n_clusters self.max_iters = max_iters self.random_state = random_state self.centroids = None self.labels = None self.inertia_ = None def fit (self, X ): np.random.seed(self.random_state) random_idx = np.random.choice(len (X), self.n_clusters, replace=False ) self.centroids = X[random_idx] for _ in range (self.max_iters): self.labels = self._assign_clusters(X) new_centroids = np.array([ X[self.labels == k].mean(axis=0 ) for k in range (self.n_clusters) ]) if np.all (self.centroids == new_centroids): break self.centroids = new_centroids self.inertia_ = sum ( np.sum ((X[self.labels == k] - self.centroids[k]) ** 2 ) for k in range (self.n_clusters) ) return self def _assign_clusters (self, X ): distances = np.array([ np.sqrt(np.sum ((X - centroid) ** 2 , axis=1 )) for centroid in self.centroids ]) return np.argmin(distances, axis=0 ) def predict (self, X ): return self._assign_clusters(X)
K-Means++初始化 K-Means++通过更智能的初始化避免差解:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 class KMeansPlusPlus (KMeans ): def fit (self, X ): np.random.seed(self.random_state) centroids = [X[np.random.randint(len (X))]] for _ in range (1 , self.n_clusters): distances = np.array([ min (np.sum ((x - c) ** 2 ) for c in centroids) for x in X ]) probs = distances / distances.sum () next_centroid_idx = np.random.choice(len (X), p=probs) centroids.append(X[next_centroid_idx]) self.centroids = np.array(centroids) for _ in range (self.max_iters): self.labels = self._assign_clusters(X) new_centroids = np.array([ X[self.labels == k].mean(axis=0 ) for k in range (self.n_clusters) ]) if np.all (self.centroids == new_centroids): break self.centroids = new_centroids return self
肘部法则选择K值 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 from sklearn.datasets import make_blobsX, _ = make_blobs(n_samples=300 , centers=4 , random_state=42 ) inertias = [] K_range = range (1 , 11 ) for k in K_range: km = KMeans(n_clusters=k) km.fit(X) inertias.append(km.inertia_) plt.figure(figsize=(8 , 5 )) plt.plot(K_range, inertias, 'bo-' ) plt.xlabel('Number of clusters (K)' ) plt.ylabel('Inertia' ) plt.title('Elbow Method' ) plt.show()
轮廓系数 轮廓系数衡量聚类的紧密度和分离度:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from sklearn.metrics import silhouette_score, silhouette_samplessilhouette_scores = [] for k in range (2 , 11 ): km = KMeans(n_clusters=k) km.fit(X) score = silhouette_score(X, km.labels) silhouette_scores.append(score) plt.figure(figsize=(8 , 5 )) plt.plot(range (2 , 11 ), silhouette_scores, 'bo-' ) plt.xlabel('Number of clusters (K)' ) plt.ylabel('Silhouette Score' ) plt.title('Silhouette Analysis' ) plt.show()
使用scikit-learn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from sklearn.cluster import KMeans as SklearnKMeansfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler() X_scaled = scaler.fit_transform(X) kmeans = SklearnKMeans(n_clusters=4 , init='k-means++' , n_init=10 , random_state=42 ) labels = kmeans.fit_predict(X_scaled) plt.figure(figsize=(8 , 6 )) scatter = plt.scatter(X_scaled[:, 0 ], X_scaled[:, 1 ], c=labels, cmap='viridis' ) plt.scatter(kmeans.cluster_centers_[:, 0 ], kmeans.cluster_centers_[:, 1 ], s=200 , c='red' , marker='X' ) plt.title('K-Means Clustering' ) plt.colorbar(scatter) plt.show()
K-Means的局限性
假设球形簇 :K-Means假设簇是凸形的,对非凸簇效果差
需要预先指定K :不合适的K值导致差结果
对初始化敏感 :不同的初始化可能得到不同结果
对异常值敏感 :少量异常值可能显著影响聚类中心
替代算法 1 2 3 4 5 6 7 8 9 from sklearn.cluster import DBSCAN, AgglomerativeClusteringdbscan = DBSCAN(eps=0.5 , min_samples=5 ) labels_db = dbscan.fit_predict(X_scaled) agg = AgglomerativeClustering(n_clusters=4 ) labels_agg = agg.fit_predict(X_scaled)
图像分割应用 1 2 3 4 5 6 7 8 9 10 11 from sklearn.datasets import load_sample_imageimage = load_sample_image('china.jpg' ) h, w, c = image.shape image_2d = image.reshape(-1 , c).astype(float ) kmeans_img = SklearnKMeans(n_clusters=16 , random_state=42 ) labels_img = kmeans_img.fit_predict(image_2d) quantized = kmeans_img.cluster_centers_[labels_img].reshape(h, w, c).astype(np.uint8)
总结 K-Means是最常用的聚类算法,简单高效。K-Means++改进了初始化,肘部法则和轮廓系数帮助选择K值。对于非凸簇,DBSCAN等密度聚类算法更合适。在实际应用中,数据预处理(特别是标准化)对K-Means效果影响很大。