在地图上聚类位置，其中每个聚类具有相同数量的点

Posted 2023-03-12

技术标签:

【中文标题】在地图上聚类位置，其中每个聚类具有相同数量的点【英文标题】：Clustering positions on a map where each cluster has an equal number of points 【发布时间】：2020-08-14 13:25:12 【问题描述】：

我在地图上有特定的点，我需要将它们分组到具有相同大小的不同集群中，最后一个集群可以是count %n。我阅读了这些答案1、2 和3，但没有帮助。我尝试了不同的方式，但它们都不起作用。在这段代码中，我指定了n_clusters=4，因为这是我可以对它们进行排序并从排序点获取n 最佳点的最佳簇数，然后我将遍历所有点。例如，我需要将图中显示的32 点集群到4 集群，并且每个集群都有8 点

dfcluster = DataFrame(position, columns=['x', 'y'])
kmeans = KMeans(n_clusters=4).fit(dfcluster)
centroids = kmeans.cluster_centers_

# plt.scatter(dfcluster['x'], dfcluster['y'], c=kmeans.labels_.astype(float), s=50, alpha=0.5)
# plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=50)
# plt.show()
dfcluster['cluster'] = kmeans.labels_
dfcluster=dfcluster.drop_duplicates(['x', 'y'], keep='last')
dfcluster = dfcluster.sort_values(['cluster', 'x', 'y'], ascending=True)
# d=pd.DataFrame()
# m = pd.DataFrame()
# n=8
# for x in range(4) :
#     m= dfcluster[dfcluster.cluster == x]
#
#
#     if len(m) > int( n /2)-1:
#         m=m.head(int(n/2)-1)
#         # for idx, row in m.iterrows():
#         #     print("code3 group",  "=", row['cluster'])
#         d=d.append(m,ignore_index = True)
#
#     else :
#         d=d.append(m,ignore_index = True)
#
#
# if len(d)>=n:
#     dfcluster = d
# dfcluster.groupby('cluster').nth(n))
dfcluster=dfcluster.head(n)
i=0
if (len(dfcluster )< n):
   change_df()

【问题讨论】：

len(dfcluster )<n) 将如何改变？这是在循环吗？这个参数n_clusters=4也控制着你正在谈论的方面，我不确定你是否可以通过集群来决定这样的细节（这么多组中的这个数量）。我认为部分想法是机器负责决定该配置是否有意义，如果没有意义，它就不会这样做，只要您的数据足够并且适用于您正在尝试做的事情.寻求第二意见您能否提供一个示例输入以及预期的输出是什么，或者一个可以让您了解您想要实现的目标的示例？你的聚类标准是什么？在这里您使用的是 KMeans，但我们不知道您的积分有什么功能。您还指定了 4 个集群...那么应该是 4 个集群吗？ @dzang 感谢您的回复，地图中的点发生了变化，我需要确保点可以聚类。我指定了n_clusters=4，因为这是我可以对它们进行排序并从排序点中获取n 最佳点的最佳簇数。选择最佳点是什么意思？我想向您指出，您没有得到答案，因为您的问题没有明确表述。如果您给出一个包含一些测试数据的示例，以及您希望从中得到什么，这将有助于理解您想要实现的目标。如果要根据位置在空间上划分点，那么“相同大小的集群”是什么意思？相同数量的点或相同的空间扩展？我建议您花一些时间重新提出问题，这将比赏金更能帮助您。 【参考方案1】：

我发现这个模块使用了Same Size Constrained K-Means Heuristics: Use Heuristics methods to reach the same size clustering，它提供了相同大小的组。

我以pip install size-constrained-clustering或pip install git+https://github.com/jingw2/size_constrained_clustering.git开头，你可以使用minmax flow或Heuristics。

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)

model = equal.SameSizeKMeansMinCostFlow(n_clusters)

#model = equal.SameSizeKMeansHeuristics(n_clusters)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

如果size-constrained-clustering模块有问题，可以使用这些类，但需要安装k-means-constrained

pip install k-means-constrained

SameSizeKMeansMinCostFlow类

from k_means_constrained import KMeansConstrained
import warnings
import base
from scipy.spatial.distance import cdist
class SameSizeKMeansMinCostFlow(base.Base):

    def __init__(self, n_clusters, max_iters=1000, distance_func=cdist, random_state=42):
        '''
        Args:
            n_clusters (int): number of clusters
            max_iters (int): maximum iterations
            distance_func (object): callable function with input (X, centers) / None, by default is l2-distance
            random_state (int): random state to initiate, by default it is 42
        '''
        super(SameSizeKMeansMinCostFlow, self).__init__(n_clusters, max_iters, distance_func)
        self.clf = None

    def fit(self, X):
        n_samples, n_features = X.shape
        minsize = n_samples // self.n_clusters
        maxsize = (n_samples + self.n_clusters - 1) // self.n_clusters

        clf = KMeansConstrained(self.n_clusters, size_min=minsize,
                                size_max=maxsize)

        if minsize != maxsize:
            warnings.warn("Cluster minimum and maximum size are  and , respectively".format(minsize, maxsize))

        clf.fit(X)

        self.clf = clf
        self.cluster_centers_ = self.clf.cluster_centers_
        self.labels_ = self.clf.labels_

    def predict(self, X):
        return self.clf.predict(X)

base类

#!usr/bin/python 3.7
#-*-coding:utf-8-*-

'''
@file: base.py, base for clustering algorithm
@Author: Jing Wang (jingw2@foxmail.com)
@Date: 06/07/2020
'''
from scipy.spatial.distance import cdist
import numpy as np 
import warnings
import scipy.sparse as sp

import os 
import sys 
path = os.path.dirname(os.path.abspath(__file__))
sys.path.append(path)
from k_means_constrained.sklearn_import.utils.extmath import stable_cumsum

class Base(object):

    def __init__(self, n_clusters, max_iters, distance_func=cdist):
        '''
        Base Cluster object

        Args:
            n_clusters (int): number of clusters 
            max_iters (int): maximum iterations
            distance_func (callable function): distance function callback
        '''
        assert isinstance(n_clusters, int)
        assert n_clusters >= 1
        assert isinstance(max_iters, int)
        assert max_iters >= 1
        self.n_clusters = n_clusters 
        self.max_iters = max_iters
        if distance_func is not None and not callable(distance_func):
            raise Exception("Distance function is not callable")
        self.distance_func = distance_func

    def fit(self, X):
        pass 

    def predict(self, X):
        pass 

def k_init(X, n_clusters, x_squared_norms, random_state=42, distance_func=cdist, n_local_trials=None):
    """Init n_clusters seeds according to k-means++

    Parameters
    ----------
    X : array or sparse matrix, shape (n_samples, n_features)
        The data to pick seeds for. To avoid memory copy, the input data
        should be double precision (dtype=np.float64).

    n_clusters : integer
        The number of seeds to choose

    x_squared_norms : array, shape (n_samples,)
        Squared Euclidean norm of each data point.

    random_state : int, RandomState instance
        The generator used to initialize the centers. Use an int to make the
        randomness deterministic.
        See :term:`Glossary <random_state>`.

    n_local_trials : integer, optional
        The number of seeding trials for each center (except the first),
        of which the one reducing inertia the most is greedily chosen.
        Set to None to make the number of trials depend logarithmically
        on the number of seeds (2+log(k)); this is the default.

    Notes
    -----
    Selects initial cluster centers for k-mean clustering in a smart way
    to speed up convergence. see: Arthur, D. and Vassilvitskii, S.
    "k-means++: the advantages of careful seeding". ACM-SIAM symposium
    on Discrete algorithms. 2007

    Version ported from http://www.stanford.edu/~darthur/kMeansppTest.zip,
    which is the implementation used in the aforementioned paper.
    """
    n_samples, n_features = X.shape

    centers = np.empty((n_clusters, n_features), dtype=X.dtype)

    assert x_squared_norms is not None, 'x_squared_norms None in _k_init'

    # Set the number of local seeding trials if none is given
    if n_local_trials is None:
        # This is what Arthur/Vassilvitskii tried, but did not report
        # specific results for other than mentioning in the conclusion
        # that it helped.
        n_local_trials = 2 + int(np.log(n_clusters))

    # Pick first center randomly
    center_id = random_state.randint(n_samples)
    if sp.issparse(X):
        centers[0] = X[center_id].toarray()
    else:
        centers[0] = X[center_id]

    # Initialize list of closest distances and calculate current potential
    closest_dist_sq = distance_func(
        centers[0, np.newaxis], X)
    current_pot = closest_dist_sq.sum()

    # Pick the remaining n_clusters-1 points
    for c in range(1, n_clusters):
        # Choose center candidates by sampling with probability proportional
        # to the squared distance to the closest existing center
        rand_vals = random_state.random_sample(n_local_trials) * current_pot
        candidate_ids = np.searchsorted(stable_cumsum(closest_dist_sq),
                                        rand_vals)
        # XXX: numerical imprecision can result in a candidate_id out of range
        np.clip(candidate_ids, None, closest_dist_sq.size - 1,
                out=candidate_ids)

        # Compute distances to center candidates
        # distance_to_candidates = euclidean_distances(
        #     X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True)
        distance_to_candidates = distance_func(X[candidate_ids], X)

        # update closest distances squared and potential for each candidate
        np.minimum(closest_dist_sq, distance_to_candidates,
                   out=distance_to_candidates)
        candidates_pot = distance_to_candidates.sum(axis=1)

        # Decide which candidate is the best
        best_candidate = np.argmin(candidates_pot)
        current_pot = candidates_pot[best_candidate]
        closest_dist_sq = distance_to_candidates[best_candidate]
        best_candidate = candidate_ids[best_candidate]

        # Permanently add best center candidate found in local tries
        if sp.issparse(X):
            centers[c] = X[best_candidate].toarray()
        else:
            centers[c] = X[best_candidate]

    return centers

【讨论】：

在 Google Colab 和 Window 10 上安装包出现错误取决于旧版本。所以，我将编辑答案。【参考方案2】：

聚类本身将决定每个聚类所需的数据点数量。

如果你想将数据分成 4 个同样大的组，基于接近度，那么你应该确定相距最远的 4 个点，然后迭代地将最近的邻居添加到这些数据点，以防万一这些还没有在一个集群中。不过我不希望这看起来很漂亮。

【讨论】：

这并不像你想的那么简单，我使用n_clusters=4，因为这是我得到的最好的，我使用k-means来帮助我从排序的集群中选择n点。我还是不明白这个问题。您能否举一个直观的例子来说明您拥有什么以及您想要什么？我需要对点进行分组，我需要它们的大小相同，但我没有具体的大小，我不知道这四个集群是否是最好的。此外，选择簇头并选择其最近邻居并不容易。 ibb.co/F6ZRdr9 什么是最好的？因为让每个集群都是 1 分肯定是一个微不足道的答案，但这可能不是你想要的。从 1 到 X 进行聚类，选择分布最均匀的聚类，并根据自己的喜好围绕最远的点进行洗牌？如果您的基本标准是具有偶数个点，我仍然看不到聚类的意义，因为这不是聚类的作用。请阅读这些答案***.com/questions/5452576/…

以上是关于在地图上聚类位置，其中每个聚类具有相同数量的点的主要内容，如果未能解决你的问题，请参考以下文章