如何创建平衡的 k-means 地理空间集群？

Posted 2023-03-12

技术标签:

【中文标题】如何创建平衡的 k-means 地理空间集群？【英文标题】：How to create balanced k-means geospatial clusters? 【发布时间】：2020-10-13 22:02:12 【问题描述】：

我有 9000 个基于美国的积分（即帐户），具有各种不同的字符串和数字列/属性。我试图将这些点/帐户平均划分为公平的分组，这些分组既按空间分组，又按员工数量加权（在重力意义上），这是列/属性之一。我使用 sklearn K-means 聚类进行分组，它似乎工作正常，但我注意到分组不相等。有些组有约 600 个，有些组有约 70 个。这在某种程度上是合乎逻辑的，因为在某些领域有更多的数据。这里的问题是我需要这些群体更加平等。这是我使用的代码：

kmeans = KMeans(n_clusters = 30, max_iter=1000, init ='k-means++')

lat_long = dftobeclustered[dftobeclustered.columns[1:3]]
_employees = dftobeclustered[dftobeclustered.columns[3]]

weighted_kmeans_clusters = kmeans.fit(lat_long, sample_weight = _employees)
dftobeclustered['cluster_label'] = kmeans.predict(lat_long, sample_weight = _employees)

centers = kmeans.cluster_centers_ 

labels = dftobeclustered['cluster_label']

是否可以以更平等的方式划分 k-means 集群？我认为核心问题是，当我真正需要将这些地区合并成更大的群体时，它将像蒙大拿州或夏威夷这样的低人口地区分成自己的群体。但我不知道。

【问题讨论】：

【参考方案1】：

K-means 并不是为了这样工作而编写的。观测值根据与质心的实际 MEASURED 距离分配给集群。

如果您尝试强制确定集群中的成员数量，则会完全取消该距离测量组件，尤其是在您与 Lat Lon 进行地理交谈时。

您可能需要查看另一种方法来对您的观察进行子集化或重新考虑集群的等效大小。

老实说，大多数情况下，地理距离聚类与其他观察结果的相似性直接相关（想想房屋样式、人口统计数据或社区的收入，以及如何将其转化为邮政编码或树木本地化区域中的类型）。这类事情不尊重我们对它们成为相同大小的组的需求。

如果在偶数个观测值中存在明显差异，则基于地理以外的质量的集群更有可能趋于平稳，因为它们将按距离排序......没有办法。

因此，观测人口密集的区域将比观测人口少的区域拥有更多成员。并且 MT 和 HI 之间的距离将始终大于 MT 和 NYC，因此它们不会按距离进行地理聚类。

我知道您想要平等的分组...是否有必要按地理分组？鉴于 MT 和 HI 会在一起，地理标签意义不大。使用所有 NON 地理数值进行聚类以创建上下文相似的观察结果可能会更好。

否则，您可以使用业务规则来剖析观察结果（我的意思是如果 var_x > 7 & var_y groupby() 和 @987654322 @ 在 pandas 中创建交叉表以查看哪些可能是好的拆分值。

【讨论】：

【参考方案2】：

试试 DBSCAN。请参阅下面的示例代码。

# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint


# define the number of kilometers in one radian
kms_per_radian = 6371.0088


# load the data set
df = pd.read_csv('C:\\your_path\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()


# how many rows are in this data set?
len(df)


# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)

 

# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)

# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian


start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))


# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))

# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
    
    size = 150
    if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
        color = 'gray'
        size = 30
    
    # plot the points that match the current cluster label
    # X.iloc[:-1]
    # df.iloc[:, 0]
    x_coords = df_coords.iloc[:, 0]
    y_coords = df_coords.iloc[:, 1]
    ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)

ax.set_title('Number of clusters: '.format(num_clusters))
plt.show()

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: :0.03f'.format(metrics.silhouette_score(df_coords, cluster_labels)))


# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian

# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)

# get the number of clusters
num_clusters = len(set(cluster_labels))

# all done, print the outcome
message = 'Clustered :, points down to :, clusters, for :.1f% compression in :,.2f seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))


# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds

coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: :0.03f'.format(metrics.silhouette_score(df_coords, cluster_labels)))



# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: '.format(num_clusters))

结果：

Number of clusters: 138


# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()

结果：

0                  lat        lon
1587  37.921659  22...
1                  lat        lon
1658  37.933609  23...
2                  lat        lon
1607  37.966766  23...
3                  lat        lon
1586  38.149019  22...
4                  lat        lon
1584  38.374766  21...
                       
133              lat        lon
662  50.37369  18.889205
134               lat        lon
561  50.448704  19.0...
135               lat        lon
661  50.462271  19.0...
136               lat        lon
559  50.489304  19.0...
137             lat       lon
1  51.474005 -0.450999

数据来源：

https://github.com/gboeing/2014-summer-travels/tree/master/data