r - 检查向量上的每个值在一组区域上的次数
Posted
技术标签:
【中文标题】r - 检查向量上的每个值在一组区域上的次数【英文标题】:r - Check how many times each value on a vector is on a set of areas 【发布时间】:2019-04-04 23:48:30 【问题描述】:我有两个数据框,第一个有一些点的坐标,另一个有一组区域,在 lat 和 lon 上都有限制。我想知道每个点、它所在的区域(或多个区域)以及可用的总容量。
比如df1有点,df2有面积和容量
df1 <- data.frame(cluster = c("id1", "id2", "id3"),
lat_m = c(-3713015, -4086295, -3710672),
lon_m = c(-6556760, -6516930, -6569831))
df2 <- data.frame(id = c("a1","a2","a3"),
max_lat = c(-3713013,-3713000, -3710600),
min_lat = c(-3713017,-3713100, -3710700),
max_lon = c(-6556755,-6556740, -6569820),
min_lon = c(-6556765,-6556800, -6569840),
capacity = c(5,2,3))
我想要这样的东西
result <- data.frame(cluster = c("id1", "id2", "id3"),
areas = c(2, 0, 1),
areas_id = c("a1, a2", "", "a3"),
capacity = c(7, 0, 3))
我的数据有超过 100 万个点和超过 10000 个区域(它会增加),所以理想情况下我应该避免 for 循环
【问题讨论】:
对!为更清晰而编辑 【参考方案1】:这是使用sqldf
和dplyr
的解决方案-
library(sqldf)
library(dplyr)
sql <- paste0(
"SELECT df1.cluster, df2.id, df2.capacity ",
"FROM df1 LEFT JOIN df2 ON (df1.lat_m BETWEEN df2.min_lat AND df2.max_lat) AND ",
"(df1.lon_m BETWEEN df2.min_lon AND df2.max_lon)"
)
result <- sqldf(sql) %>%
group_by(cluster) %>%
summarise(
areas = n_distinct(id) - anyNA(id),
areas_id = toString(id),
capacity = sum(capacity, na.rm = T)
)
# A tibble: 3 x 4
cluster areas areas_id capacity
<fct> <int> <chr> <dbl>
1 id1 2 a1, a2 7.00
2 id2 0 NA 0
3 id3 1 a3 3.00
【讨论】:
【参考方案2】:您可以在>=
和<=
条件下将两个表连接在一起,然后按cluster
组进行汇总。
library(data.table)
library(magrittr) # not necessary, just loaded for %>%
setDT(df1)
setDT(df2)
df2[df1, on = .(min_lat <= lat_m, max_lat >= lat_m, min_lon <= lon_m, max_lon >= lon_m)
, .(cluster, id, capacity)] %>% # these first two lines do the join
.[, .(areas = sum(!is.na(capacity))
, areas_id = paste(id, collapse = ', ')
, capacity = sum(capacity, na.rm = T))
, by = cluster] # this summarises each cluster group of rows
# cluster areas areas_id capacity
# 1: id1 2 a1, a2 7
# 2: id2 0 NA 0
# 3: id3 1 a3 3
SQL 代码版本(部分来自@shree 的回答):
library(sqldf)
sqldf("
select df1.cluster
, case when sum(df2.capacity) is NULL
then 0
else count(*)
end as areas
, group_concat(df2.id) as areas_id
, coalesce(sum(df2.capacity), 0) as capacity
from df1
left join df2
on df1.lat_m between df2.min_lat and df2.max_lat
and df1.lon_m between df2.min_lon and df2.max_lon
group by df1.cluster
")
# cluster areas areas_id capacity
# 1 id1 2 a1,a2 7
# 2 id2 0 <NA> 0
# 3 id3 1 a3 3
【讨论】:
从我在网上找到的基准测试来看,data.table
很可能会击败sqldf
。不确定这个具体案例,因为我不熟悉这些包中的任何一个。以上是关于r - 检查向量上的每个值在一组区域上的次数的主要内容,如果未能解决你的问题,请参考以下文章