R中的距离计算优化
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了R中的距离计算优化相关的知识,希望对你有一定的参考价值。
我想知道下面是否有任何方法可以优化距离计算过程。我在下面留下了一个小示例,但是我正在处理包含6000行以上的电子表格,并且计算变量d需要花费大量时间。可以通过某种方式将其调整为具有相同结果,但以优化的方式。
library(rdist)
library(tictoc)
library(geosphere)
time<-tic()
df<-structure(list(Industries=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19), Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9,
+ + -23.9, -23.9, -23.9, -23.9, -23.9), Longitude = c(-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7,
+ + -49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6)), class = "data.frame", row.names = c(NA, -19L))
k=3
#clusters
coordinates<-df[c("Latitude","Longitude")]
d<-as.dist(distm(coordinates[,2:1]))
fit.average<-hclust(d,method="average")
clusters<-cutree(fit.average, k)
nclusters<-matrix(table(clusters))
df$cluster <- clusters
time<-toc()
1.54 sec elapsed
d
1 2 3 4 5 6 7 8
2 0.00
3 11075.61 11075.61
4 11075.61 11075.61 0.00
5 11075.61 11075.61 0.00 0.00
6 11075.61 11075.61 0.00 0.00 0.00
7 11075.61 11075.61 0.00 0.00 0.00 0.00
8 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00
9 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00
10 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00
11 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02
12 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02
13 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02
14 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02
15 15048.01 15048.01 10183.02 10183.02 10183.02 10183.02 10183.02 10183.02
16 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00
17 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00
18 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00
19 11075.61 11075.61 0.00 0.00 0.00 0.00 0.00 0.00
9 10 11 12 13 14 15 16
2
3
4
5
6
7
8
9
10 0.00
11 10183.02 10183.02
12 10183.02 10183.02 0.00
13 10183.02 10183.02 0.00 0.00
14 10183.02 10183.02 0.00 0.00 0.00
15 10183.02 10183.02 0.00 0.00 0.00 0.00
16 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02
17 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00
18 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00
19 0.00 0.00 10183.02 10183.02 10183.02 10183.02 10183.02 0.00
17 18
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 0.00
19 0.00 0.00
使用OSRM软件包
#Calculate travel time matrices between points.
travelTime<-osrmTable(loc=df[1:19, c("Industries", "Longitude", "Latitude")], gepaf=FALSE)
#Calculate distance
coordinates<-df[c("Latitude","Longitude")]
distance <- osrmRoute(loc = coordinates, returnclass = "sf")
答案
@@ Jose也许在数学上(就聚类而言)听起来不那么好,但是(通常)可以更好地度量大圆距离(Vincenty的公式)。大约要快8倍(我认为这是您想要的结果)-(仅使用示例数据即可)。
# Order the dataframe by Lon and Lat: ordered_df => data.frame
ordered_df <-
df %>%
arrange(., Longitude, Latitude)
# Scalar valued at how many clusters we are expecting => integer vector
k = 3
# Matrix of co-ordinates: coordinates => matrix
coordinates <-
ordered_df %>%
select(Longitude, Latitude) %>%
as.matrix()
# Generate great circle distances between points and Long-Lat Matrix: d => data.frame
d <- data.frame(Dist = c(0, distVincentyEllipsoid(coordinates)))
# Segment the distances into groups: cluster => factor
d$Cluster <- factor(cumsum(d$Dist > (quantile(d$Dist, 1/k))) + 1)
# Merge with base data: clustered_df => data.frame
clustered_df <- cbind(ordered_df, d)
库和样本数据:
library(geosphere)
library(dplyr)
df <- structure(list(Industries=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19),
Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9),
Longitude = c(-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7,-49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6)),
class = "data.frame", row.names = c(NA, -19L))
start_time <- Sys.time()
以上是关于R中的距离计算优化的主要内容,如果未能解决你的问题,请参考以下文章