给定相异矩阵，如何在 Python 中执行 PAM 聚类？

Posted 2023-03-12

技术标签:

【中文标题】给定相异矩阵，如何在 Python 中执行 PAM 聚类？【英文标题】：How to perform PAM clustering in Python given a dissimilarity matrix? 【发布时间】：2021-03-17 18:05:52 【问题描述】：

我有一个数据框 df，其中包含 id、text、lang、stemmed 和 tfidfresult 列。 df 有 24 行。我根据 tfidf 结果找到了相异矩阵（距离矩阵），它给出了数据帧中两行的不同程度。

数据框外观示例如下：

   id     text                lang                    stemmed                  tf_idfresult
0 234  Hi this                  en [hi, this]                   [0.0, 0.2]
1 232  elephants ruined again   en [elephants, ruined, again]   [0.1, 0.0, 0.0]
2 441  there are palm trees     en [there, are, palm, trees]    [0.2, 0.54, 0.0, 0.823]
3 235  so much to do            en [so, much, to, do]           [0.1, 0.1, 0.0, 0.0]

在 cosine_similarity 函数的帮助下找到了相异矩阵 dis，看起来像

[[0.0, 0.3, 0.1, 1, 1...]
[0.1, ...]
.
.

24 行 24 列。

我使用了剪影方法并找到了 k 的最佳值，即 3。我尝试过这样做

pam = kmedoids(dis, initialmedoids)

但我不知道如何找到最初的中心点。预期的输出是三个集群中的数据帧。我没有任何特定的输出格式。

【问题讨论】：

请提供完整的副本和可粘贴的示例熊猫数据集以及您的预期输出。请在此处查看如何向 pandas 提问：***.com/questions/20109391/… @DavidErickson 好的，我会编辑问题 【参考方案1】：

我也一直在尝试使用 k-medoids 并且已经迷失了！我读到了一些工具来做这件事。其中两个是：

sklearn_extra.cluster.KMedoids。设置 kargs method='pam' 和 metric='precomputed'。运行分析后，您可以使用 kmedoids.labels_ 将每个样本分配到哪个集群。可以以this tutorial为基础，编写一个程序，根据簇对样本进行分离。

pyclustering.cluster.kmedoid。这是你正在使用的，我猜？根据您的代码，您应该：

from pyclustering.cluster.kmedoids import kmedoids

pam = kmedoids(dis, initialmedoids)

pam.process()

clusters = pam.get_clusters()

【讨论】：

以上是关于给定相异矩阵，如何在 Python 中执行 PAM 聚类？的主要内容，如果未能解决你的问题，请参考以下文章