为啥在 Google Colab 中重新启动运行时后 KMeans 的 silhouette_score 会发生变化？

Posted 2023-03-12

技术标签:

【中文标题】为啥在 Google Colab 中重新启动运行时后 KMeans 的 silhouette_score 会发生变化？【英文标题】：Why does KMeans' silhouette_score change after restarting runtime in Google Colab?为什么在 Google Colab 中重新启动运行时后 KMeans 的 silhouette_score 会发生变化？ 【发布时间】：2021-07-06 04:10:21 【问题描述】：

我正在尝试从在 Google Colab 笔记本上运行的 sklearn 的 KMeans 获得可重现的结果。 Kmeans 算法正在拟合由主成分分析 (PCA) 生成的数组。每次我重新启动 notebook 的运行时，拟合、预测和生成 K-means 算法的 silhouette_score 时，silhouette_score 都会发生变化！

这是我使用 Kmeans 进行拟合和预测并生成剪影分数的代码：

for n_clusters in range(3,9):
    kmeans = KMeans(init= 'k-means++', n_clusters = n_clusters, n_init= 25, random_state = 0)
    kmeans.fit(pca_mat_products)
    clusters = kmeans.predict(pca_mat_products)
    silhouette_avg = silhouette_score(mp_matrix, clusters, random_state= 0)
    print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

以下是生成的剪影分数示例：

For n_clusters = 3 The average silhouette_score is : 0.08689747798228342
For n_clusters = 4 The average silhouette_score is : 0.11513524544540599
For n_clusters = 5 The average silhouette_score is : 0.13225896257848024
For n_clusters = 6 The average silhouette_score is : 0.13390795741576195
For n_clusters = 7 The average silhouette_score is : 0.11262045164741093
For n_clusters = 8 The average silhouette_score is : 0.12179451798486395

当我重新启动笔记本的运行时同时保持笔记本上的所有内容（包括 random_state =0）并从头开始运行单元格时，每次我重新启动笔记本时都会出现新的剪影分数。

这是同一代码在不同运行中生成的剪影分数：

For n_clusters = 3 The average silhouette_score is : 0.09181951382862036
For n_clusters = 4 The average silhouette_score is : 0.11539863985647045
For n_clusters = 5 The average silhouette_score is : 0.13363229313208771
For n_clusters = 6 The average silhouette_score is : 0.13428788881085452
For n_clusters = 7 The average silhouette_score is : 0.13187306014661757
For n_clusters = 8 The average silhouette_score is : 0.13252806332855294

在进一步的运行时，剪影分数会不断变化。

mp_matrix 是 one-hot 编码数组，如下所示：

array([[0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 1, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

在 Google Colab 中重新启动运行时后，剪影分数发生变化是否正常？有没有办法获得可重现的剪影分数？

我在网上和其他地方搜索过，没有发现正在讨论这个问题。

谢谢！

感谢您的帮助。

【问题讨论】：

init的值是多少？ init 当前为 = 'k-means++'，我已将其添加到代码中。这里的mp_matrix 是什么？您确定您的其他数据没有变化吗？在这个网站（和其他地方）上，当寻求帮助时，尝试提供MWE 非常重要，这样其他人就可以尝试复制您的问题。有时，在隔离问题以生成 MWE 时，您甚至可能发现错误并最终不再需要帮助。我无法在 Colab 中使用使用 make_blobs() 创建的数据集和使用 mp_matrix 数据的另一个假数据集重现您的问题。重新启动运行时后，我得到相同的平均轮廓分数。考虑@dpkandy 的答案。除非您使用样本/数据子集，否则计算轮廓分数没有随机性。 【参考方案1】：

根据您的代码，您似乎正在根据 PCA 的结果进行聚类：

  kmeans.fit(pca_mat_products)
  clusters = kmeans.predict(pca_mat_products)

如果您需要 PCA 的可重现结果，也可以在此处设置 random_state。

这里是文档：https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

【讨论】：

确切地说，原因是生成用于聚类的结果的 PCA 结果中的随机性。我也在 PCA 中设置了 random_state，它解决了这个问题。非常感谢！

以上是关于为啥在 Google Colab 中重新启动运行时后 KMeans 的 silhouette_score 会发生变化？的主要内容，如果未能解决你的问题，请参考以下文章

Google驱动器与COLAB断开连接

Google Colab 运行速度比 Jupyterlab 快，Google Colab 为啥以及如何运行？

为啥该代码段无法在 Google Colab 上运行？

为啥 Google Colab 说我有太多会话？

使用 Google Colab -- GPU 设备未找到错误

Google Colab遇到点bug