范围连接 data.frames - R 中具有日期范围/间隔的特定日期列

Posted

技术标签:

【中文标题】范围连接 data.frames - R 中具有日期范围/间隔的特定日期列【英文标题】:Range join data.frames - specific date column with date ranges/intervals in R 【发布时间】:2014-06-15 19:25:18 【问题描述】:

虽然这些细节当然是特定于应用程序的,但本着 SO 的精神,我会尽量保持一般性!基本问题是当一个 data.frame 具有特定日期而另一个 data.frame 具有日期范围时,如何按日期合并 data.frames。其次,该问题询问如何处理给定变量的多个观察结果,以及如何将这些观察结果包含在最终输出 data.frame 中。我确信其中一些是标准的,但相当完整的搜索几乎没有发现什么。

我要合并的 mre 对象如下。

# 'Speeches' data.frame
structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("BBB", 
"AAA"), class = "factor"), Date = structure(c(12543, 12404, 12404, 
12404, 12373, 12362, 12345, 12320, 12207, 15450, 15449, 15449, 
15449, 15449, 15449, 15449, 15449, 15448, 15448, 15448), class = "Date")), .Names =     c("Name", 
"Date"), row.names = c("1", "1.1", "1.2", "1.3", "1.4", "1.5", 
"1.6", "1.7", "1.8", "2", "2.1", "2.2", "2.3", "2.4", "2.5", 
"2.6", "2.7", "2.8", "2.9", "2.10"), class = "data.frame")

# 'History' data.frame
structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("BBB", "AAA"), class = "factor"), 
    Role = structure(c(1L, 2L, 3L, 3L, 3L, 4L, 1L, 2L, 3L, 3L, 
3L, 3L, 4L), .Label = c("Political groups", "National parties", 
"Member", "Substitute", "Vice-Chair", "Chair", "Vice-President", 
"Quaestor", "President", "Co-President"), class = "factor"), 
Value = structure(c(10L, 12L, 6L, 3L, 8L, 4L, 9L, 11L, 1L, 
7L, 1L, 2L, 5L), .Label = c("a", "b", "c", "d", "e", "f", 
"g", "h", "i", "j", "k", "l", "m", "n", "o"), class = "factor"), 
Role.Start = structure(c(12149, 12149, 12150, 12150, 12152, 
12150, 14439, 14439, 14441, 14503, 15358, 15411, 14441), class = "Date"), 
Role.End = structure(c(12618, 12618, 12618, 12618, 12538, 
12618, 15507, 15507, 15357, 15507, 15410, 15507, 15357), class = "Date")), .Names = c("Name", 
"Role", "Value", "Role.Start", "Role.End"), row.names = c(NA, 
13L), class = "data.frame")

我面临着许多困难。

1) 虽然演讲和历史数据中都有日期信息,但在第一个中我为每个条目指定了具体日期,在第二个中有一个日期范围。理想情况下,我希望能够合并,以便每个演讲条目都与演讲者(“姓名”)和演讲日期所属的历史条目相匹配。

2) 期望的输出是有一个 data.frame 或 data.table,其行等于演讲 data.frame 中的观察值,以及名称、日期和每个角色的列(将由价值)。但是,对于给定的发言人,某些角色在给定日期出现多次,因此我需要能够为这些实例创建多个列。

下面的对象给出了这个输出,但是是用一个非常脆弱和非常慢的 for 循环构造的:

structure(list(Name = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("BBB", 
"AAA"), class = "factor"), Date = structure(c(12543, 12404, 12404, 
12404, 12373, 12362, 12345, 12320, 12207, 15450, 15449, 15449, 
15449, 15449, 15449, 15449, 15449, 15448, 15448, 15448), class = "Date"), 
`Political groups` = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("i", 
"j"), class = "factor"), `National parties` = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("k", "l"), class = "factor"), 
Member.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("f", 
"g"), class = "factor"), Member.2 = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("b", "c"), class = "factor"), Member.3 = structure(c(NA, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA), .Label = "h", class = "factor"), Substitute = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA), .Label = "d", class = "factor")), .Names = c("Name", 
"Date", "Political groups", "National parties", "Member.1", "Member.2", 
"Member.3", "Substitute"), row.names = c("1", "1.1", "1.2", "1.3", 
"1.4", "1.5", "1.6", "1.7", "1.8", "2", "2.1", "2.2", "2.3", 
"2.4", "2.5", "2.6", "2.7", "2.8", "2.9", "2.10"), class = "data.frame")

欢迎任何有关如何改进此问题的帮助和/或 cmets!

【问题讨论】:

【参考方案1】:

更新:在 v1.9.3+ 中,现在实现了重叠连接。这是一个特殊情况,在Speeches 中,开始和结束Date 相同。我们可以使用foverlaps() 来完成此操作,如下所示:

require(data.table) ## 1.9.3+
setDT(Speeches)
setDT(History)

Speeches[, `:=`(Date2 = Date, id = .I)]
setkey(History, Name, Role.Start, Role.End)

ans = foverlaps(Speeches, History, by.x=c("Name", "Date", "Date2"))[, Date2 := NULL]
ans = ans[order(id, Value)][, N := 1:.N, by=list(Name, Date, Role, id)]
ans = dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

这是范围/间隔连接的情况。

这是data.table 方式。它使用两个滚动连接。

require(data.table) ## 1.9.2+
dt1 = as.data.table(Speeches)
dt2 = as.data.table(History)

# first rolling join - to get end indices
setkey(dt2, Name, Role.Start)
tmp1 = dt2[dt1, roll=Inf, which=TRUE]

# second rolling join - to get start indices
setkey(dt2, Name, Role.End)
tmp2 = dt2[dt1, roll=-Inf, which=TRUE]

# generate dt1's and dt2's corresponding row indices
idx = tmp1-tmp2+1L
idx1 = rep(seq_len(nrow(dt1)), idx)
idx2 = data.table:::vecseq(tmp2, idx, sum(idx))

dt1[, id := 1:.N] ## needed for casting later

# subset using idx1 and idx2 and bind them colwise
ans = cbind(dt1[idx1], dt2[idx2, -1L, with=FALSE])

# a little reordering to get the output correctly (factors are a pain!)
ans = ans[order(id,Value)][, N := 1:.N, by=list(Name, Date, Role, id)]

# finally cast them.
f_ans = dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

这是输出:

    id Name       Date Political groups_1 National parties_1 Member_1 Member_2 Member_3 Substitute_1
 1:  1  AAA 2004-05-05                  j                  l        c        f       NA            d
 2:  2  AAA 2003-12-18                  j                  l        c        f        h            d
 3:  3  AAA 2003-12-18                  j                  l        c        f        h            d
 4:  4  AAA 2003-12-18                  j                  l        c        f        h            d
 5:  5  AAA 2003-11-17                  j                  l        c        f        h            d
 6:  6  AAA 2003-11-06                  j                  l        c        f        h            d
 7:  7  AAA 2003-10-20                  j                  l        c        f        h            d
 8:  8  AAA 2003-09-25                  j                  l        c        f        h            d
 9:  9  AAA 2003-06-04                  j                  l        c        f        h            d
10: 10  BBB 2012-04-20                  i                  k        b        g       NA           NA
11: 11  BBB 2012-04-19                  i                  k        b        g       NA           NA
12: 12  BBB 2012-04-19                  i                  k        b        g       NA           NA
13: 13  BBB 2012-04-19                  i                  k        b        g       NA           NA
14: 14  BBB 2012-04-19                  i                  k        b        g       NA           NA
15: 15  BBB 2012-04-19                  i                  k        b        g       NA           NA
16: 16  BBB 2012-04-19                  i                  k        b        g       NA           NA
17: 17  BBB 2012-04-19                  i                  k        b        g       NA           NA
18: 18  BBB 2012-04-18                  i                  k        b        g       NA           NA
19: 19  BBB 2012-04-18                  i                  k        b        g       NA           NA
20: 20  BBB 2012-04-18                  i                  k        b        g       NA           NA

或者,您也可以使用 bioconductor 的 GenomicRanges 包来完成此操作,它可以很好地处理范围,特别是当您需要除范围之外的附加列通过 (Name) 加入时。您可以从here 安装它。

require(GenomicRanges)
require(data.table)
dt1 <- as.data.table(Speeches)
dt2 <- as.data.table(History)
gr1 = GRanges(Rle(dt1$Name), IRanges(as.numeric(dt1$Date), as.numeric(dt1$Date)))
gr2 = GRanges(Rle(dt2$Name), IRanges(as.numeric(dt2$Role.Start), as.numeric(dt2$Role.End)))

olaps = findOverlaps(gr1, gr2, type="within")
idx1 = queryHits(olaps)
idx2 = subjectHits(olaps)

# from here, you can do exactly as above
dt1[, id := 1:.N]
...
...
dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

给出与上述相同的结果。

【讨论】:

这种data.table 方法(经过一些测试)可以包装在一个不错的小函数(范围连接和/或间隔连接)中以供直接使用。我认为这会很有帮助。 这些都很棒。 GenomicRanges 最适合我的特定目的,但我同意一些 data.table 函数将是一个很好的一般贡献。 @jlhoward 在下面提供了另一个不错的选择,它也很有效。【参考方案2】:

这是使用sqldf 包中的sqldf(...) 的方法。这会产生您的结果,但以下情况除外:

    Member.n 列包含按字母顺序排列的值,而不是它们在History 数据框中出现的顺序。所以Member.1 将包含cMember.2 将包含f,而不是相反。 您的结果集将所有与角色相关的列作为因素,而此结果集将它们作为字符。如果很重要,可以轻松更改。

请注意,SpeechesHistory 用于输入数据框,我使用您的 Output 数据框仅获取列的顺序。

library(sqldf)    # for sqldf(...)
library(reshape2) # for dcast(...)

colnames(History)[4:5] <- c("Start","End")   # sqldf doesn't like "." in colnames
Speeches$id <- rownames(Speeches)            # need unique id column
result <- sqldf("select a.id, a.Name, a.Date, b.Role, b.Value 
                from Speeches a, History b 
                where a.Name=b.Name and a.Date between b.Start and b.End")
Roles <- aggregate(Role~Name+Date+id,result,function(x)
  ifelse(x=="Member",paste(x,1:length(x),sep="."),as.character(x)))$Role
result$Roles <- unlist(Roles)
result <- dcast(result,Name+Date+id~Roles,value.var="Value")
result <- result[order(result$id),]   # re-order the rows
result <- result[,colnames(Output)]   # re-order the columns

说明

首先,我们需要Speeches 中的id 列来区分结果中的复制列。所以我们为此使用行名。 其次,我们使用sqldf(...) 根据您的条件合并SpeechesHistory 表。因为您希望根据范围匹配日期,所以这可能是最好的方法。 第三,我们必须将“Member”的多个实例转换为“Member.1”、“Member.2”等。我们使用aggregate(...)paste(...) 执行此操作。 第四,我们必须将“长”格式(一列中的所有值,由第二列角色区分)的sql结果转换为“宽”格式,每个角色的值在不同列中.我们使用dcast(...) 执行此操作。 最后,我们重新排列行和列以与您的结果保持一致。

【讨论】:

这也是一个很好的答案。我稍微喜欢@Arun 的解决方案,因为它不需要使用 sqldf。非常感谢。

以上是关于范围连接 data.frames - R 中具有日期范围/间隔的特定日期列的主要内容,如果未能解决你的问题,请参考以下文章

R语言merge函数连接多个dataframe数据集迭代内连接dataframe数据( iteratively merge data frames in R)默认merge函数通过公共列名合并数据

如何在 R 中 dplyr::inner_join 多个 tbls 或 data.frames

Java对象类似于R data.frame [关闭]

使用 R 将列表的元素保存为 data.frames

在过滤R data.frames时更新因子水平[重复]

R学习-7.Matrices and Data Frames