r按两个日期之间的id和日期合并

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了r按两个日期之间的id和日期合并相关的知识,希望对你有一定的参考价值。

我有dataset1有两列IDApplication_SubmittedDate。 Application_SubmittedDate列是日期/时间列。

     ID         Application_SubmittedDate
     6972        2001-05-30 16:57:00
     6972        2003-03-08 12:30:00
     6972        2006-03-22 17:43:00
     6972        2003-08-07 20:20:00
     6972        2006-07-28 18:28:00
     6972        2001-05-25 17:14:00
     6972        2003-09-30 00:48:00
     6972        2002-06-04 18:11:00
     6972        2006-05-06 17:30:00
     6972        2003-02-24 16:02:00
     6972        2006-09-16 16:29:00
     6972        2003-02-12 22:47:00
     6972        2002-08-15 23:30:00
     6972        2002-08-31 22:32:00
     40841       2002-09-27 05:39:00
     40841       2002-01-08 09:05:00
     40841       2002-10-07 21:04:00
     40841       2002-08-17 18:50:00
     59547       2003-08-12 10:45:00
     59547       2001-02-20 17:02:00
     59547       2002-11-05 23:01:00
     60861       2003-10-27 14:40:00
     63457       2001-12-05 04:16:00
     65048       2002-12-16 10:18:00
     65048       2003-12-29 17:52:00
     65048       2005-02-20 16:58:00
     67037       2004-01-01 18:18:00
     67037       2006-06-22 01:04:00
     67037       2004-07-31 18:30:00
     67037       2004-08-04 14:09:00
     67037       2005-04-20 18:06:00
     67037       2006-06-15 16:55:00

df1 <- structure(list(ID = c(6972L, 6972L, 6972L, 6972L, 6972L, 6972L, 
6972L, 6972L, 6972L, 6972L, 6972L, 6972L, 6972L, 6972L, 40841L, 
40841L, 40841L, 40841L, 59547L, 59547L, 59547L, 60861L, 63457L, 
65048L, 65048L, 65048L, 67037L, 67037L, 67037L, 67037L, 67037L, 
67037L), Application_SubmittedDate = structure(c(991241820, 1047126600, 
1143049380, 1060287600, 1154111280, 990810840, 1064882880, 1023214260, 
1146936600, 1046102520, 1158424140, 1045090020, 1029454200, 1030833120, 
1033105140, 1010480700, 1034024640, 1029610200, 1060685100, 982688520, 
1036537260, 1067265600, 1007525760, 1040033880, 1072720320, 1108918680, 
1072981080, 1150938240, 1091298600, 1091628540, 1114020360, 1150390500
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("ID", 
"Application_SubmittedDate"), class = "data.frame", row.names = c(1L, 
18L, 35L, 52L, 69L, 86L, 103L, 137L, 154L, 188L, 205L, 239L, 
256L, 273L, 290L, 300L, 305L, 310L, 315L, 327L, 339L, 351L, 352L, 
353L, 359L, 371L, 389L, 400L, 411L, 422L, 466L, 477L))

第二个数据集有三列IDApplication_ProcessStartDateApplication_ProcessEndDate。这两个Applicateion ProcessStarDate和EndDate列是日期/时间列。

    ID     Application_ProcessStartDate Application_ProcessEndDate
    65048  2005-02-20 12:44:22          2005-02-23 06:07:45       
    65048  2006-06-21 17:31:45          2006-06-24 01:42:41       
    111993 2006-06-21 17:31:45          2006-06-24 01:42:41      




    df2 <- structure(list(ID = c(65048L, 65048L, 111993L), Application_ProcessStartDate = structure(c(1108903462, 
1150911105, 1150911105), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Application_ProcessEndDate = structure(c(1109138865, 1151113361, 
    1151113361), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("ID", 
"Application_ProcessStartDate", "Application_ProcessEndDate"), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

我的目标是合并第一)通过ID 2)在那些ID合并那些来自df1的那些,其中Application_SubmittedDate值介于Application_ProcessStartDateApplication_ProcessEndDate值之间。

最终结果看起来像这样

         ID         Application_SubmittedDate   Application_ProcessStartDate    Application_ProcessEndDate
         6972        2001-05-30 16:57:00
         6972        2003-03-08 12:30:00
         6972        2006-03-22 17:43:00
         6972        2003-08-07 20:20:00
         6972        2006-07-28 18:28:00
         6972        2001-05-25 17:14:00
         6972        2003-09-30 00:48:00
         6972        2002-06-04 18:11:00
         6972        2006-05-06 17:30:00
         6972        2003-02-24 16:02:00
         6972        2006-09-16 16:29:00
         6972        2003-02-12 22:47:00
         6972        2002-08-15 23:30:00
         6972        2002-08-31 22:32:00
         40841       2002-09-27 05:39:00
         40841       2002-01-08 09:05:00
         40841       2002-10-07 21:04:00
         40841       2002-08-17 18:50:00
         59547       2003-08-12 10:45:00
         59547       2001-02-20 17:02:00
         59547       2002-11-05 23:01:00
         60861       2003-10-27 14:40:00
         63457       2001-12-05 04:16:00
         65048       2002-12-16 10:18:00
         65048       2003-12-29 17:52:00         
         65048       2005-02-20 16:58:00         2005-02-20 12:44:22          2005-02-23 06:07:45  
         65048            NA                     2006-06-21 17:31:45          2006-06-24 01:42:41 
         67037       2004-01-01 18:18:00
         67037       2006-06-22 01:04:00
         67037       2004-07-31 18:30:00
         67037       2004-08-04 14:09:00
         67037       2005-04-20 18:06:00
         67037       2006-06-15 16:55:00
         111993        NA                        2006-06-21 17:31:45          2006-06-24 01:42:41

我试过foverlaps这不处理日期/时间值,只有日期值,所以这是排除。我也试过了sqldf图书馆的JOIN,但这只是INNER JOINS,而不是OUTER JOINS所以这也被排除了。不知道如何做到这一点。非常感谢任何帮助或建议。

答案

使用data.table可以实现另一种解决方案。该方法将加入df1df2right_joinleft_join,然后合并两者。

library(data.table)
setDT(df1)
setDT(df2)


rhs_join <- df1[df2, 
                .(i.ID, x.Application_SubmittedDate, i.Application_ProcessStartDate, 
                  i.Application_ProcessEndDate),
                on = .(ID = ID, Application_SubmittedDate >= Application_ProcessStartDate,
                       Application_SubmittedDate <= Application_ProcessEndDate)][,.(ID = i.ID, Application_SubmittedDate = x.Application_SubmittedDate,
    Application_ProcessStartDate = i.Application_ProcessStartDate,
    Application_ProcessEndDate = i.Application_ProcessEndDate)]

lhs_join <- df2[df1, 
                .(i.ID, Application_SubmittedDate, x.Application_ProcessStartDate,
                  x.Application_ProcessEndDate),
                on = .(ID = ID, Application_ProcessStartDate <= Application_SubmittedDate,
                       Application_ProcessEndDate >= Application_SubmittedDate)][,.(ID = i.ID, Application_SubmittedDate = Application_SubmittedDate,
    Application_ProcessStartDate = x.Application_ProcessStartDate,
    Application_ProcessEndDate = x.Application_ProcessEndDate)]


#Merge both data.frames
merge(rhs_join, lhs_join, all=TRUE)

Result

ID Application_SubmittedDate Application_ProcessStartDate Application_ProcessEndDate
 1:   6972       2001-05-25 17:14:00                         <NA>                       <NA>
 2:   6972       2001-05-30 16:57:00                         <NA>                       <NA>
 3:   6972       2002-06-04 18:11:00                         <NA>                       <NA>
.....
.....
.....
23:  63457       2001-12-05 04:16:00                         <NA>                       <NA>
24:  65048                      <NA>          2006-06-21 17:31:45        2006-06-24 01:42:41
25:  65048       2002-12-16 10:18:00                         <NA>                       <NA>
26:  65048       2003-12-29 17:52:00                         <NA>                       <NA>
27:  65048       2005-02-20 16:58:00          2005-02-20 12:44:22        2005-02-23 06:07:45
28:  67037       2004-01-01 18:18:00                         <NA>                       <NA>
29:  67037       2004-07-31 18:30:00                         <NA>                       <NA>
30:  67037       2004-08-04 14:09:00                         <NA>                       <NA>
31:  67037       2005-04-20 18:06:00                         <NA>                       <NA>
32:  67037       2006-06-15 16:55:00                         <NA>                       <NA>
33:  67037       2006-06-22 01:04:00                         <NA>                       <NA>
34: 111993                      <NA>          2006-06-21 17:31:45        2006-06-24 01:42:41
另一答案

问题中的描述似乎不清楚,但也许你想要其中一个左连接。对于问题中显示的数据,这些数据分别产生32行和3行。

library(sqldf)

sqldf("select a.*, 
              b.Application_ProcessStartDate,
              b.Application_ProcessEndDate
       from df1 a left join df2 b
       on a.ID = b.ID and 
          a.Application_SubmittedDate between 
              b.Application_ProcessStartDate and
              b.Application_ProcessEndDate")

sqldf("select a.*, 
              b.Application_ProcessStartDate,
              b.Application_ProcessEndDate
       from df2 b left join df1 a
       on a.ID = b.ID and 
          a.Application_SubmittedDate between 
              b.Application_ProcessStartDate and
              b.Application_ProcessEndDate")

或者你可能正在寻找两者的结合:

sqldf("select a.*, 
              b.Application_ProcessStartDate,
              b.Application_ProcessEndDate
       from df1 a left join df2 b
       on a.ID = b.ID and 
          a.Application_SubmittedDate between 
              b.Application_ProcessStartDate and
              b.Application_ProcessEndDate

union

select a.*, 
              b.Application_ProcessStartDate,
              b.Application_ProcessEndDate
       from df2 b left join df1 a
       on a.ID = b.ID and 
          a.Application_SubmittedDate between 
              b.Application_ProcessStartDate and
              b.Application_ProcessEndDate")

以上是关于r按两个日期之间的id和日期合并的主要内容,如果未能解决你的问题,请参考以下文章

按 ID 和日期将两个表合并为新表

power query 根据两个日期之间的事务日期合并两个表

当日期在其他两个日期之间时,如何将数据集连接到另一个 R

R-基于最近日期合并数据框

如何基于R中的2个日期时间变量合并行

合并一个值在另外两个之间的熊猫数据框[重复]