Inner_join 有两个条件和区间内的区间条件
Posted
技术标签:
【中文标题】Inner_join 有两个条件和区间内的区间条件【英文标题】:Inner_join with two conditions and interval within interval condition 【发布时间】:2018-10-17 04:40:13 【问题描述】:尝试根据多个条件和时间间隔条件加入 2 个数据帧,如下例所示:
# two sample dataframes with time intervals
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))) %>%
mutate(t1 = interval(time1, time2)) %>%
select(key1, key2, t1)
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))) %>%
mutate(t2 = interval(sam1, sam2)) %>%
select(key1, key2, t2)
需要对应的第一件事是列key1
和key2
,可以通过以下方式完成(产生错误):
df <- inner_join(df1, df2, by = c("key1", "key2"))
但是加入时还有一个条件需要检查,那就是间隔t2
是否在t1
之内。我可以像这样手动执行此操作:
df$t2 %within% df$t1
我猜这个错误来自于以间隔连接数据帧,这可能不是正确的方法,这就是为什么会出现错误。
# desired dataframe
df <- data.frame(key1 = c("a", "b"), key2 = c(1,2), time_condition = c(TRUE, FALSE))
如果 t1 来自 "00:00:15" to "00:05:15"
,则对应的 t2 即 "00:03:15" to "00:04:15"
将在间隔 t1 内。这将导致 time_condition 列,如果 t2 在 t1 内,则为 TRUE
,否则为 FALSE。
【问题讨论】:
在连接的上下文中,一个区间在另一个区间内是什么意思?您能通过示例数据向我们展示吗? 帖子已被修改以澄清所请求的信息 【参考方案1】:这个怎么样?
library(dplyr)
df1 %>%
inner_join(df2, by = c("key1", "key2")) %>%
filter(sam1 >= time1 & sam1 <= time2 & sam2 >= time1 & sam2 <= time2) %>%
mutate(t1 = interval(time1, time2),
t2 = interval(sam1, sam2)) %>%
select(key1, key2, t1, t2)
输出为:
key1 key2 t1 t2
1 a 1 1970-01-01 00:00:15 UTC--1970-01-01 00:05:15 UTC 1970-01-01 00:03:15 UTC--1970-01-01 00:04:15 UTC
样本数据:
df1 <- data.frame(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.POSIXct(hms::as.hms(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15"))),
time2 = as.POSIXct(hms::as.hms(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15"))))
df2 <- data.frame(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.POSIXct(hms::as.hms(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15"))),
sam2 = as.POSIXct(hms::as.hms(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15"))))
【讨论】:
【参考方案2】:使用 data.table,您可以在加入时执行操作。这是一个例子
library(data.table)
df2[df1, # left join
.(time_condition = sam1 > time1 & sam2 < time2), # condition while joining
on = .(key1, key2), # keys
by = .EACHI, # check condition per join
nomatch = 0L] # make it an inner join
# key1 key2 time_condition
# 1: a 1 TRUE
# 2: b 2 FALSE
# your data generated using data.table
df1 <- data.table(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.ITime(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15")),
time2 = as.ITime(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))
df2 <- data.table(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.ITime(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15")),
sam2 = as.ITime(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))
【讨论】:
【参考方案3】:您可以使用内置函数 merge() 进行连接。
df = merge(df1, df2, by = c("key1", "key2"))
df = data.frame(df[,c("key1", "key2")], time_condition = df$t2 %within% df$t1)
df
# key1 key2 time_condition
#1 a 1 TRUE
#2 b 2 FALSE
谢谢
【讨论】:
以上是关于Inner_join 有两个条件和区间内的区间条件的主要内容,如果未能解决你的问题,请参考以下文章