聚类后获取点ID，使用python [重复]

Posted 2023-03-12

技术标签:

【中文标题】聚类后获取点ID，使用python [重复]【英文标题】：Get point IDs after clustering, using python [duplicate] 【发布时间】：2012-01-26 07:38:45 【问题描述】：

可能重复：Python k-means algorithm

我想根据它们的特征向量聚类 10000个索引点，并在聚类后获取它们的id，即cluster1：[p1，p3，p100，...]，集群2：[...] ...

有没有办法在 Python 中做到这一点？谢谢~

附：索引点存储在一个 10000*10 的矩阵中，其中每一行代表一个特征向量。

【问题讨论】：

您能否为像我们这样不了解集群但稍微了解 Python 的人添加一个示例 :-) 呃——你见过this吗？ @Abhijit：见K-means algorithm。基本上，您在 n 维空间中有一堆点，并且您希望自动找到点的“集群”（即根据某些相似性对它们进行分组）。 K-means 选择 K 个随机起点（种子），然后根据哪个种子最接近每个点来划分所有点。然后计算每个集群的质心，然后进行迭代，直到质心停止移动。 @Cameron 我的问题与"Python k-means algorithm" 的帖子不太一样。聚类后我需要获取点 ID。所以我想要一些关于如何将点 ID 与特征相关联的建议。谢谢~ 谢谢大家。我的解决方案是 1. 运行 K 均值并获取集群中心。 2. 重新计算每个点到聚类中心的距离，这样我就可以记录与聚类相关的Id。 【参考方案1】：

使用一些聚类算法 - 我已经包含了 @Cameron 在他的第二条评论中链接到的 K-means 算法的实现，但您可能需要参考 the link in his first comment。我不确定您所说的获取他们的 ID 是什么意思，您能详细说明一下吗？

from math import sqrt

def k_means(data_pts, k=None):
    """ Return k (x,y) pairs where:
            k = number of clusters
        and each
            (x,y) pair = centroid of cluster

        data_pts should be a list of (x,y) tuples, e.g.,
            data_pts=[ (0,0), (0,5), (1,3) ]
    """

    """ Helper functions """
    def lists_are_same(la, lb): # see if two lists have the same elements
        out = False
        for item in la:
            if item not in lb:
                out = False
                break
            else:
                out = True
        return out  
    def distance(a, b): # distance between (x,y) points a and b
        return sqrt(abs(a[0]-b[0])**2 + abs(a[1]-b[1])**2)
    def average(a): # return the average of a one-dimensional list (e.g., [1, 2, 3])
        return sum(a)/float(len(a))

    """ Set up some initial values """
    if k is None: # if the user didn't supply a number of means to look for, try to estimate how many there are
        n = len(data_pts)# number of points in the dataset
        k = int(sqrt(n/2))  # number of clusters - see
                        #   http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#Rule_of_thumb
    if k < 1: # make sure there's at least one cluster
        k = 1



    """ Randomly generate k clusters and determine the cluster centers,
        or directly generate k random points as cluster centers. """

    init_clusters = data_pts[:]         # put all of the data points into clusters
    shuffle(init_clusters)          # put the data points in random order
    init_clusters = init_clusters[0:k]  # only keep the first k random clusters

    old_clusters, new_clusters = ,  
    for item in init_clusters:
        old_clusters[item] = [] # every cluster has a list of points associated with it. Initially, it's 0

    while 1: # just keep going forever, until our break condition is met
        tmp = 
        for k in old_clusters: # create an editable version of the old_clusters dictionary
            tmp[k] = []

        """ Associate each point with the closest cluster center. """
        for point in data_pts: # for each (x,y) data point
            min_clust = None
            min_dist = 1000000000 # absurdly large, should be larger than the maximum distance for most data sets
            for pc in tmp: # for every possible closest cluster
                pc_dist = distance(point, pc)
                if pc_dist < min_dist: # if this cluster is the closest, have it be the closest (duh)
                    min_dist = pc_dist
                    min_clust = pc
            tmp[min_clust].append(point) # add each point to its closest cluster's list of associated points

        """ Recompute the new cluster centers. """
        for k in tmp:
            associated = tmp[k]
            xs = [pt[0] for pt in associated] # build up a list of x's
            ys = [pt[1] for pt in associated] # build up a list of y's
            x = average(xs) # x coordinate of new cluster
            y = average(ys) # y coordinate of new cluster
            new_clusters[(x,y)] = associated # these are the points the center was built off of, they're *probably* still associated

        if lists_are_same(old_clusters.keys(), new_clusters.keys()): # if we've reached equilibrium, return the points
            return old_clusters.keys()
        else: # otherwise, we'll go another round. let old_clusters = new_clusters, and clear new_clusters.
            old_clusters = new_clusters
            new_clusters =

【讨论】：

我将点存储在 10000*10 矩阵中，其中 10000 是点数，10 是特征数。首先，我们可以使用您的代码或 Scipy 包中的实现来进行 K 表示聚类，然后我想知道：对于每个聚类，相关点是什么。此外，我不仅想知道这些点特征，还想知道哪些行是来自的点。好的，我必须在运行 Kmeans 并记录 ID 后重新计算所有点到聚类中心的距离。我认为您也可以将“特征”作为额外数据存储在点元组中。即，示例数据点元组是(x_cord, y_cord, feature1, feature2, feature3, feature4, feature5, feature6, feature7, feature8, feature9, feature10)。点元组的唯一要求是 x-y 坐标分别位于索引 0 和 1 中。

以上是关于聚类后获取点ID，使用python [重复]的主要内容，如果未能解决你的问题，请参考以下文章

python代码在kmeans聚类后查找特征重要性

如何在 Mahout K-means 聚类中维护数据条目 ID

KMeans聚类并绘制聚类后的决策边界

在K意味着使用R进行聚类后，检索最接近每个聚类质心的100个样本

使用百度地图api可视化聚类结果

k-modes 聚类后为新数据分配聚类的简单方法

聚类后​​获取点ID，使用python [重复]

聚类后获取点ID，使用python [重复]