K-means 如何确定特定纬度和经度附近的大多数位置 [关闭]
Posted
技术标签:
【中文标题】K-means 如何确定特定纬度和经度附近的大多数位置 [关闭]【英文标题】:K-means how to determine most locations near specific latitudes and longitudes [closed] 【发布时间】:2020-10-03 10:53:39 【问题描述】:我知道城市中每个社区的中心纬度和经度,并且我有一组餐厅及其经度和纬度。我需要使用 K-means 之类的方法来确定哪个社区最密集。所以我们就说,我有第一个系列,比如说十个纬度和经度,第二个系列大约 200 个,我如何确定这十个纬度中哪个最密集,或者附近的纬度最多?
【问题讨论】:
【参考方案1】:这个怎么样?
# import necessary modules
import pandas as pd, numpy as np, matplotlib.pyplot as plt, time
from sklearn.cluster import DBSCAN
from sklearn import metrics
from geopy.distance import great_circle
from shapely.geometry import MultiPoint
# define the number of kilometers in one radian
kms_per_radian = 6371.0088
# load the data set
df = pd.read_csv('C:\\your_path_here\\summer-travel-gps-full.csv', encoding = "ISO-8859-1")
df.head()
# how many rows are in this data set?
len(df)
# scatterplot it to get a sense of what it looks like
df = df.sort_values(by=['lat', 'lon'])
ax = df.plot(kind='scatter', x='lon', y='lat', alpha=0.5, linewidth=0)
# represent points consistently as (lat, lon)
# coords = df.as_matrix(columns=['lat', 'lon'])
df_coords = df[['lat', 'lon']]
# coords = df.to_numpy(df_coords)
# define epsilon as 10 kilometers, converted to radians for use by haversine
epsilon = 10 / kms_per_radian
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# get colors and plot all the points, color-coded by cluster (or gray if not in any cluster, aka noise)
fig, ax = plt.subplots()
colors = plt.cm.rainbow(np.linspace(0, 1, len(unique_labels)))
# for each cluster label and color, plot the cluster's points
for cluster_label, color in zip(unique_labels, colors):
size = 150
if cluster_label == -1: #make the noise (which is labeled -1) appear as smaller gray points
color = 'gray'
size = 30
# plot the points that match the current cluster label
# X.iloc[:-1]
# df.iloc[:, 0]
x_coords = df_coords.iloc[:, 0]
y_coords = df_coords.iloc[:, 1]
ax.scatter(x=x_coords, y=y_coords, c=color, edgecolor='k', s=size, alpha=0.5)
ax.set_title('Number of clusters: '.format(num_clusters))
plt.show()
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: :0.03f'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# set eps low (1.5km) so clusters are only formed by very close points
epsilon = 1.5 / kms_per_radian
# set min_samples to 1 so we get no noise - every point will be in a cluster even if it's a cluster of 1
start_time = time.time()
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(df_coords))
cluster_labels = db.labels_
unique_labels = set(cluster_labels)
# get the number of clusters
num_clusters = len(set(cluster_labels))
# all done, print the outcome
message = 'Clustered :, points down to :, clusters, for :.1f% compression in :,.2f seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
# Result:
Silhouette coefficient: 0.854
Clustered 1,759 points down to 138 clusters, for 92.2% compression in 0.17 seconds
coefficient = metrics.silhouette_score(df_coords, cluster_labels)
print('Silhouette coefficient: :0.03f'.format(metrics.silhouette_score(df_coords, cluster_labels)))
# number of clusters, ignoring noise if present
num_clusters = len(set(cluster_labels)) #- (1 if -1 in labels else 0)
print('Number of clusters: '.format(num_clusters))
# Result:
Number of clusters: 138
# create a series to contain the clusters - each element in the series is the points that compose each cluster
clusters = pd.Series([df_coords[cluster_labels == n] for n in range(num_clusters)])
clusters.tail()
最终结果:
0 lat lon
1587 37.921659 22...
1 lat lon
1658 37.933609 23...
2 lat lon
1607 37.966766 23...
3 lat lon
1586 38.149019 22...
4 lat lon
1584 38.374766 21...
133 lat lon
662 50.37369 18.889205
134 lat lon
561 50.448704 19.0...
135 lat lon
661 50.462271 19.0...
136 lat lon
559 50.489304 19.0...
137 lat lon
1 51.474005 -0.450999
https://github.com/gboeing/urban-data-science/blob/2017/15-Spatial-Cluster-Analysis/cluster-analysis.ipynb
https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
【讨论】:
【参考方案2】:如果您知道每个街区的边界(或近似的半径),从城市的一些地图数据中,您可以检查每个餐厅位于哪个街区。
否则,您可以计算餐厅与街区中心点之间的距离,并将 200 家餐厅中的每家都分配到最近的街区。
然后,您可以将每个社区的密度近似为该社区的餐馆数量除以餐馆总数。
我认为你不需要任何机器学习算法。
当然你可以根据你的问题选择distance metric。
【讨论】:
以上是关于K-means 如何确定特定纬度和经度附近的大多数位置 [关闭]的主要内容,如果未能解决你的问题,请参考以下文章
在给定特定纬度/经度的情况下,如何计算与纽约附近地铁入口的距离?