如何从 K-Means 聚类中解释轮廓系数?

Posted

技术标签:

【中文标题】如何从 K-Means 聚类中解释轮廓系数?【英文标题】:How do I Interpret Silhouette Coefficient from K-Means Clustering? 【发布时间】:2017-11-20 12:52:19 【问题描述】:

我正在使用 sklearn 包练习 K-Means 聚类。 我正在使用示例购物数据集,其中包含每个客户在每个项目类别(即食品、时尚、数字等)中花费的金额

有 42 个特征,意思是我用来输入 K-Means 的 42 个项目类别。当我检查 k 介于 2 - 50 之间的轮廓系数时,结果如下所示:

结果

For n_clusters=2, The Silhouette Coefficient is 0.296883351294 
For n_clusters=3, The Silhouette Coefficient is 0.429716008727
For n_clusters=4, The Silhouette Coefficient is 0.5379833453
For n_clusters=5, The Silhouette Coefficient is 0.640200087198
For n_clusters=6, The Silhouette Coefficient is 0.720988889121
For n_clusters=7, The Silhouette Coefficient is 0.754509135746
For n_clusters=8, The Silhouette Coefficient is 0.824498184042
For n_clusters=9, The Silhouette Coefficient is 0.859505132529
For n_clusters=10, The Silhouette Coefficient is 0.886719390512
For n_clusters=11, The Silhouette Coefficient is 0.909094073152
For n_clusters=12, The Silhouette Coefficient is 0.924484657787
For n_clusters=13, The Silhouette Coefficient is 0.935920328988
For n_clusters=14, The Silhouette Coefficient is 0.941202266924
For n_clusters=15, The Silhouette Coefficient is 0.944696312832
For n_clusters=16, The Silhouette Coefficient is 0.94973283735
For n_clusters=17, The Silhouette Coefficient is 0.953130541493
For n_clusters=18, The Silhouette Coefficient is 0.956455183621
For n_clusters=19, The Silhouette Coefficient is 0.959253033224
For n_clusters=20, The Silhouette Coefficient is 0.962360042108
For n_clusters=21, The Silhouette Coefficient is 0.964250208432
For n_clusters=22, The Silhouette Coefficient is 0.967326417612
For n_clusters=23, The Silhouette Coefficient is 0.969331109452
For n_clusters=24, The Silhouette Coefficient is 0.971127562002
For n_clusters=25, The Silhouette Coefficient is 0.972261973972
For n_clusters=26, The Silhouette Coefficient is 0.9734445716
For n_clusters=27, The Silhouette Coefficient is 0.974238560202
For n_clusters=28, The Silhouette Coefficient is 0.97488260729
For n_clusters=29, The Silhouette Coefficient is 0.97531193231
For n_clusters=30, The Silhouette Coefficient is 0.974524792419
For n_clusters=31, The Silhouette Coefficient is 0.975612314038
For n_clusters=32, The Silhouette Coefficient is 0.975737449165
For n_clusters=33, The Silhouette Coefficient is 0.976396323376
For n_clusters=34, The Silhouette Coefficient is 0.977655049988
For n_clusters=35, The Silhouette Coefficient is 0.977653124893
For n_clusters=36, The Silhouette Coefficient is 0.977692656935
For n_clusters=37, The Silhouette Coefficient is 0.977631627533
For n_clusters=38, The Silhouette Coefficient is 0.978547753839
For n_clusters=39, The Silhouette Coefficient is 0.978886776953
For n_clusters=40, The Silhouette Coefficient is 0.979381767137
For n_clusters=41, The Silhouette Coefficient is 0.9796349521
For n_clusters=42, The Silhouette Coefficient is 0.979461929477
For n_clusters=43, The Silhouette Coefficient is 0.980920963377
For n_clusters=44, The Silhouette Coefficient is 0.980129624336
For n_clusters=45, The Silhouette Coefficient is 0.981374785468
For n_clusters=46, The Silhouette Coefficient is 0.980656482976
For n_clusters=47, The Silhouette Coefficient is 0.982323770297
For n_clusters=48, The Silhouette Coefficient is 0.982538183341
For n_clusters=49, The Silhouette Coefficient is 0.982842003856

我不知道如何利用这个结果。在我看来,随着我的前进,s 越来越大。我这样做对吗?还是我应该尝试不同的集群评估方法?

【问题讨论】:

【参考方案1】:

一个点的轮廓衡量一个点与其集群与下一个最近的集群的相似程度。这是与聚类中心的距离比值,标准化后“1”与其聚类完全匹配,“-1”完全不匹配。

(注意:聚类中心的使用可能是 k-means 聚类所特有的。)

集群的轮廓是其所有成员的平均轮廓。这意味着实践是更大的数字意味着集群与其他集群“分离”。

我认为轮廓是衡量沿集群边界的点的密度。当轮廓很高时,边界点很少。这就是你想要的——分离良好的集群。

使用 k-means 时,小的“离群值”集群通常会有大轮廓。通常较大的集群具有密集的边界。看看尺寸和轮廓会很有趣。

【讨论】:

谢谢。所以对于我得到的结果,49 个集群比 2 个集群好。这意味着有了 49 个集群,它与其他集群更加分离。我说的对吗? @2D_ 。 . .好吧,您必须以不同的方式评估集群。如果每个点都有一个单独的集群,那么我认为轮廓看起来会非常好(我不是 100% 确定在退化的情况下会发生什么)。更重要的是:集群有用吗? 你是对的。我想你可能是对的。我当然不想要太多的集群。我将研究这些集群,并确定最有意义的数字。谢谢!

以上是关于如何从 K-Means 聚类中解释轮廓系数?的主要内容,如果未能解决你的问题,请参考以下文章

K-Means - 为啥最佳聚类数随轮廓分析而变化?

机器学习之K-Means聚类(python手写实现+使用Silhouette Coefficient来选取最优k值)

如何在 Mahout K-means 聚类中维护数据条目 ID

数据分析和数据挖掘的一些知识点

第十七节 K-means

K-means聚类算法一文详解+Python代码实例