K-Means聚类算法:从原理到实战

🎙️ 语音朗读 当前: 晓晓 (温柔女声)

K-Means聚类算法:从原理到实战

K-Means是最经典的无监督学习算法之一,用于将数据划分为K个簇,使簇内方差最小化。

算法原理

K-Means的目标是最小化簇内平方误差和:

$$J = \sum_{k=1}^{K}\sum_{\mathbf{x}_i \in C_k} |\mathbf{x}_i - \boldsymbol{\mu}_k|^2$$

算法步骤:

  1. 随机初始化K个聚类中心
  2. 将每个样本分配到最近的聚类中心
  3. 更新聚类中心为簇内样本均值
  4. 重复步骤2-3直到收敛

手动实现K-Means

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import matplotlib.pyplot as plt

class KMeans:
def __init__(self, n_clusters=3, max_iters=100, random_state=42):
self.n_clusters = n_clusters
self.max_iters = max_iters
self.random_state = random_state
self.centroids = None
self.labels = None
self.inertia_ = None

def fit(self, X):
np.random.seed(self.random_state)
# 随机初始化聚类中心
random_idx = np.random.choice(len(X), self.n_clusters, replace=False)
self.centroids = X[random_idx]

for _ in range(self.max_iters):
# 分配样本到最近的聚类中心
self.labels = self._assign_clusters(X)

# 更新聚类中心
new_centroids = np.array([
X[self.labels == k].mean(axis=0)
for k in range(self.n_clusters)
])

# 检查收敛
if np.all(self.centroids == new_centroids):
break
self.centroids = new_centroids

# 计算惯性(簇内平方和)
self.inertia_ = sum(
np.sum((X[self.labels == k] - self.centroids[k]) ** 2)
for k in range(self.n_clusters)
)
return self

def _assign_clusters(self, X):
distances = np.array([
np.sqrt(np.sum((X - centroid) ** 2, axis=1))
for centroid in self.centroids
])
return np.argmin(distances, axis=0)

def predict(self, X):
return self._assign_clusters(X)

K-Means++初始化

K-Means++通过更智能的初始化避免差解:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class KMeansPlusPlus(KMeans):
def fit(self, X):
np.random.seed(self.random_state)
# K-Means++初始化
centroids = [X[np.random.randint(len(X))]]

for _ in range(1, self.n_clusters):
# 计算每个样本到最近聚类中心的距离
distances = np.array([
min(np.sum((x - c) ** 2) for c in centroids)
for x in X
])
# 按距离比例选择下一个聚类中心
probs = distances / distances.sum()
next_centroid_idx = np.random.choice(len(X), p=probs)
centroids.append(X[next_centroid_idx])

self.centroids = np.array(centroids)

# 标准K-Means迭代
for _ in range(self.max_iters):
self.labels = self._assign_clusters(X)
new_centroids = np.array([
X[self.labels == k].mean(axis=0)
for k in range(self.n_clusters)
])
if np.all(self.centroids == new_centroids):
break
self.centroids = new_centroids

return self

肘部法则选择K值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

inertias = []
K_range = range(1, 11)

for k in K_range:
km = KMeans(n_clusters=k)
km.fit(X)
inertias.append(km.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

轮廓系数

轮廓系数衡量聚类的紧密度和分离度:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.metrics import silhouette_score, silhouette_samples

silhouette_scores = []
for k in range(2, 11):
km = KMeans(n_clusters=k)
km.fit(X)
score = silhouette_score(X, km.labels)
silhouette_scores.append(score)

plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), silhouette_scores, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.show()

使用scikit-learn

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.cluster import KMeans as SklearnKMeans
from sklearn.preprocessing import StandardScaler

# 标准化数据
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 聚类
kmeans = SklearnKMeans(n_clusters=4, init='k-means++', n_init=10, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# 可视化
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=200, c='red', marker='X')
plt.title('K-Means Clustering')
plt.colorbar(scatter)
plt.show()

K-Means的局限性

  1. 假设球形簇:K-Means假设簇是凸形的,对非凸簇效果差
  2. 需要预先指定K:不合适的K值导致差结果
  3. 对初始化敏感:不同的初始化可能得到不同结果
  4. 对异常值敏感:少量异常值可能显著影响聚类中心

替代算法

1
2
3
4
5
6
7
8
9
from sklearn.cluster import DBSCAN, AgglomerativeClustering

# DBSCAN:基于密度的聚类,能发现任意形状的簇
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X_scaled)

# 层次聚类
agg = AgglomerativeClustering(n_clusters=4)
labels_agg = agg.fit_predict(X_scaled)

图像分割应用

1
2
3
4
5
6
7
8
9
10
11
from sklearn.datasets import load_sample_image

# 加载图像
image = load_sample_image('china.jpg')
h, w, c = image.shape
image_2d = image.reshape(-1, c).astype(float)

# K-Means颜色量化
kmeans_img = SklearnKMeans(n_clusters=16, random_state=42)
labels_img = kmeans_img.fit_predict(image_2d)
quantized = kmeans_img.cluster_centers_[labels_img].reshape(h, w, c).astype(np.uint8)

总结

K-Means是最常用的聚类算法,简单高效。K-Means++改进了初始化,肘部法则和轮廓系数帮助选择K值。对于非凸簇,DBSCAN等密度聚类算法更合适。在实际应用中,数据预处理(特别是标准化)对K-Means效果影响很大。

© 2019-2026 ovo$^{mc^2}$ All Rights Reserved. | 站点总访问 28969 次 | 访客 19045
Theme by hiero