从各个距离创建距离矩阵
Posted
技术标签:
【中文标题】从各个距离创建距离矩阵【英文标题】:Create a distance matrix from individual distances 【发布时间】:2020-09-11 01:05:22 【问题描述】:我有一个铁路中每两个相邻车站之间按正确顺序增加的距离列表。我需要做的是为每两个站点之间的距离创建一个矩阵。这是这份清单。
+-------------------------+-------------------------+---------------+
| Departure Station | Arrival Station | distance in m |
+-------------------------+-------------------------+---------------+
| | San Francisco | 0.0 |
| San Francisco | 22nd Street | 2521.949349 |
| 22nd Street | Bayshore | 5875.8986 |
| Bayshore | South San Francisco | 6690.161279 |
| South San Francisco | San Bruno | 2964.853585 |
| San Bruno | Millbrae Transit Center | 4154.792069 |
| Millbrae Transit Center | Broadway | 2549.171972 |
| Broadway | Burlingame | 1762.653178 |
| Burlingame | San Mateo | 2307.847611 |
| San Mateo | Hayward Park | 2148.992125 |
| Hayward Park | Hillsdale | 2597.932334 |
| Hillsdale | Belmont | 2092.15 |
| Belmont | San Carlos | 1990.239598 |
| San Carlos | Redwood City | 3492.618122 |
| Redwood City | Atherton | 3847.644532 |
| Atherton | Menlo Park | 1752.92218 |
| Menlo Park | Palo Alto | 2011.382315 |
| Palo Alto | Stanford | 1582.663905 |
| Stanford | California Ave. | 965.606 |
| California Ave. | San Antonio | 3939.685111 |
| San Antonio | Mountain View | 3108.414275 |
| Mountain View | Sunnyvale | 4312.51742 |
| Sunnyvale | Lawrence | 3189.943773 |
| Lawrence | Santa Clara | 5889.680131 |
| Santa Clara | College Park | 2252.43061 |
| College Park | San Jose Diridon | 1872.857195 |
| San Jose Diridon | Tamien | 2887.967478 |
| Tamien | Capitol | 4999.21158 |
| Capitol | Blossom Hill | 5304.202424 |
| Blossom Hill | Morgan Hill | 19050.76536 |
| Morgan Hill | San Martin | 5917.5495 |
| San Martin | Gilroy | 10061.59472 |
| Gilroy | Gilroy | 0.0 |
+-------------------------+-------------------------+---------------+
我的想法是制作一个距离列表和一个电台字典及其索引,以制作一个矩阵,通过查看电台字典并定义我们需要总结的索引范围来生成值距离。我用这种方式制作这个矩阵做了很多工作,但无法获得结果。
import pandas as pd
file = open('/Users/miss_evgenia/Downloads/Caltrain Metrics - Sheet4.csv')
dist = pd.read_csv(file)
distances = list(dist['distance in m'])
#%%
names = list(dist['Departure Station'])
names.pop(0)
names= dict(zip(names, range(len(names))))
#%%
def sumRange(L,a,b):
sum = 0
for i in range(a,b+1,1):
sum += L[i]
return sum
这是我的字典和列表。
'San Francisco': 0, '22nd Street': 1, 'Bayshore': 2, 'South San Francisco': 3, 'San Bruno': 4, 'Millbrae Transit Center': 5, 'Broadway': 6, 'Burlingame': 7, 'San Mateo': 8, 'Hayward Park': 9, 'Hillsdale': 10, 'Belmont': 11, 'San Carlos': 12, 'Redwood City': 13, 'Atherton': 14, 'Menlo Park': 15, 'Palo Alto': 16, 'Stanford': 17, 'California Ave.': 18, 'San Antonio': 19, 'Mountain View': 20, 'Sunnyvale': 21, 'Lawrence': 22, 'Santa Clara': 23, 'College Park': 24, 'San Jose Diridon': 25, 'Tamien': 26, 'Capitol': 27, 'Blossom Hill': 28, 'Morgan Hill': 29, 'San Martin': 30, 'Gilroy': 31
[0.0, 2521.949349, 5875.8986, 6690.161279, 2964.8535850000003, 4154.792069, 2549.171972, 1762.653178, 2307.847611, 2148.992125, 2597.932334, 2092.15, 1990.2395980000001, 3492.618122, 3847.6445320000003, 1752.92218, 2011.3823149999998, 1582.663905, 965.6060000000001, 3939.685111, 3108.414275, 4312.51742, 3189.943773, 5889.680131, 2252.4306100000003, 1872.8571949999998, 2887.967478, 4999.21158, 5304.202424, 19050.765359999998, 5917.5495, 10061.594720000001, 0.0]
请帮忙!谢谢。
【问题讨论】:
我会转换列表以获得每个站点到旧金山的累积距离,然后你知道任何两个站点之间的距离是它们到旧金山的距离之间的差。我现在会试着为你写下来。 【参考方案1】:如果出发和到达的车站名称相同,也许你可以试试这样:
cities = np.unique(distance_table["Departure Station"])
matrix = pd.DataFrame(columns = cities, index = cities)
for j in distance_table:
matrix.at[distance_matrix.iloc[j,0],distance_matrix.iloc[j,1]] = distance_matrix.iloc[j,2]
其中 distance_table 是您在问题中显示的那个。也许你甚至可以用 .apply()
【讨论】:
【参考方案2】:您可以将站点的“位置”计算为距离的cumsum
,然后使用scipy.spatial.distance.pdist
计算距离:
from scipy.spatial.distance import pdist, squareform
positions = data['distance in m'].cumsum()
matrix = squareform(pdist(positions.to_numpy()[:, None], 'euclidean'))
【讨论】:
【参考方案3】:除了 a_guest 您还可以尝试以下操作以将结果作为带有标签的 pandas 数据框返回
def transform_dataframe():
with open("test_data.csv", "r") as input_data:
station_distances = pd.read_csv(input_data)
# to stop gilroy appearing twice
station_distances.drop(station_distances.tail(1).index,inplace=True)
cumulative_distances = station_distances['distance in m'].cumsum()
distance_matrix = cumulative_distances.values - cumulative_distances.values[:, None]
distance_matrix = pd.DataFrame(distance_matrix, index=station_distances["Arrival Station"], columns=station_distances["Arrival Station"])
return distance_matrix
【讨论】:
以上是关于从各个距离创建距离矩阵的主要内容,如果未能解决你的问题,请参考以下文章
使用带有 pdist 和 squareform 的 nparray 创建距离矩阵
R语言层次聚类(hierarchical clustering):使用scale函数进行特征缩放hclust包层次聚类(创建距离矩阵聚类绘制树状图dendrogram,在树状图上绘制红色矩形框)