基于度量的列表中的聚类元素
Posted
技术标签:
【中文标题】基于度量的列表中的聚类元素【英文标题】:clustering element in list based on a metric 【发布时间】:2021-03-08 05:56:02 【问题描述】:我有一个字典列表,其中包含关键字及其向量距离,我正在尝试应用聚类技术对它们进行分组
# data = ["key": "str1", "weight": float value, ...]
# distances = [item['weight'] for item in data]
distances = [0.004906579754566209, 0.008361678408906337, 0.010228429212122636, 0.013671005756098031, 0.013671005756098031, 0.013713535105272179]
mean_distances_differences = mean([j-i for i, j in zip(distances[:-1], distances[1:])])
我计算了列表中两个连续元素之间差异的平均值。如果两个元素之间的距离小于我想要对它们进行聚类的平均值,那么结果将是
[[0.004906579754566209], [0.008361678408906337], [0.010228429212122636], [0.013671005756098031, 0.013671005756098031, 0.013713535105272179]]
在这里我想我不能使用 knn,因为我不知道会出现多少个集群。所以我试过这样
distances = [item['weight'] for item in data]
mean_distances_differences = mean([j-i for i, j in zip(distances[:-1], distances[1:])])
distances_new = distances
required_list = []
while distances_new:
temp = []
if len(distances_new) == 1:
temp = distances_new
required_list.append(temp)
break
else:
for i,j in zip(distances_new[:-1], distances_new[1:]):
if j-1 < mean_distances_differences:
temp.append(i)
else:
break
distances_new = [_i for _i in distances_new if _i not in temp]
required_list.append(temp)
但我得到了答案
[[0.004906579754566209, 0.008361678408906337, 0.010228429212122636, 0.013671005756098031], [0.013713535105272179]]
有什么办法吗?
【问题讨论】:
【参考方案1】:你可以使用 diff 来计算距离,我取了绝对值,因为我不确定距离是否会被排序:
import numpy as np
distance_diff = abs(np.diff(distances))
如果对距离是否大于某个值进行cumsum,它将连续的小于阈值的元素组合在一起:
np.cumsum(distance_diff > abs(np.mean(distance_diff)))]
array([1, 2, 3, 3, 3])
所以剩下的就是提供一个起始组 0:
np.hstack([0,np.cumsum(distance_diff > abs(np.mean(distance_diff)))])
array([0, 1, 2, 3, 3, 3])
【讨论】:
以上是关于基于度量的列表中的聚类元素的主要内容,如果未能解决你的问题,请参考以下文章
R语言层次聚类(hierarchical clustering):特征缩放抽取hclust中的聚类簇(cutree函数从hclust对象中提取每个聚类簇的成员)基于主成分分析的进行聚类结果可视化