如何从两个合并的数据帧中选择完成之前和之后的特定时间间隔?
Posted
技术标签:
【中文标题】如何从两个合并的数据帧中选择完成之前和之后的特定时间间隔?【英文标题】:How do I select certain time interval before and after complete.cases from two merged dataframes? 【发布时间】:2021-11-27 10:17:36 【问题描述】:我已经按时间合并了两个数据框:
Mergeboth <- merge(test, deeptest, by.x="date.time", by.y="Bottom.Start", all=TRUE)
test
数据帧非常大(超过 100 万个数据点),而 deeptest
只有大约 100 个。使用complete.cases()
检查它正确合并了两个数据帧。现在数据框已合并,我想在 complete.cases 之前和之后选择某个时间间隔(假设为 2 分钟以便于输出 - 但我的实际目标是 12 小时),为所有这些分配一个新数字新选择的行并将其输出到新的数据框中:df_new
。
所以我当前的数据框如下所示:
MergeBoth()
Bottom.Start Dive Max.Depth Depth
1: 2015-12-22 01:07:00 NA NA 311.0
2: 2015-12-22 01:07:10 NA NA 308.5
3: 2015-12-22 01:07:20 NA NA 307.0
4: 2015-12-22 01:07:30 NA NA 306.5
5: 2015-12-22 01:07:40 NA NA 305.5
6: 2015-12-22 01:07:50 NA NA 308.5
7: 2015-12-22 01:08:00 NA NA 318.5
8: 2015-12-22 01:08:10 NA NA 331.0
9: 2015-12-22 01:08:20 NA NA 345.5
10: 2015-12-22 01:08:30 NA NA 361.0
11: 2015-12-22 01:08:40 NA NA 376.5
12: 2015-12-22 01:08:50 NA NA 392.0
13: 2015-12-22 01:09:00 NA NA 408.5
14: 2015-12-22 01:09:10 NA NA 425.5
15: 2015-12-22 01:09:20 NA NA 442.5
16: 2015-12-22 01:09:30 NA NA 459.0
17: 2015-12-22 01:09:40 NA NA 475.0
18: 2015-12-22 01:09:50 NA NA 491.0
19: 2015-12-22 01:10:00 1238 727 508.0
20: 2015-12-22 01:10:10 NA NA 523.5
21: 2015-12-22 01:10:20 NA NA 540.5
22: 2015-12-22 01:10:30 NA NA 556.5
23: 2015-12-22 01:10:40 NA NA 572.5
24: 2015-12-22 01:10:50 NA NA 581.0
25: 2015-12-22 01:11:00 NA NA 583.5
26: 2015-12-22 01:11:10 NA NA 582.0
27: 2015-12-22 01:11:20 NA NA 581.0
28: 2015-12-22 01:11:30 NA NA 581.0
29: 2015-12-22 01:11:40 NA NA 581.0
30: 2015-12-22 01:11:50 NA NA 581.0
31: 2015-12-22 01:12:00 NA NA 582.0
32: 2015-12-22 01:12:10 NA NA 581.5
33: 2015-12-22 01:12:20 NA NA 581.5
34: 2015-12-22 01:12:30 NA NA 592.5
35: 2015-12-22 01:12:40 NA NA 606.5
36: 2015-12-22 01:12:50 NA NA 621.5
37: 2015-12-22 01:13:00 NA NA 637.5
38: 2015-12-22 01:13:10 NA NA 655.0
39: 2015-12-23 07:17:00 NA NA 863.0
40: 2015-12-23 07:17:10 NA NA 863.5
41: 2015-12-23 07:17:20 NA NA 865.0
42: 2015-12-23 07:17:30 NA NA 866.0
43: 2015-12-23 07:17:40 NA NA 867.0
44: 2015-12-23 07:17:50 NA NA 867.5
45: 2015-12-23 07:18:00 NA NA 868.5
46: 2015-12-23 07:18:10 NA NA 870.0
47: 2015-12-23 07:18:20 NA NA 870.5
48: 2015-12-23 07:18:30 NA NA 871.0
49: 2015-12-23 07:18:40 NA NA 872.0
50: 2015-12-23 07:18:50 NA NA 872.0
51: 2015-12-23 07:19:00 1267 970 874.0
52: 2015-12-23 07:19:10 NA NA 875.0
53: 2015-12-23 07:19:20 NA NA 875.0
54: 2015-12-23 07:19:30 NA NA 876.0
55: 2015-12-23 07:19:40 NA NA 876.5
56: 2015-12-23 07:19:50 NA NA 876.0
57: 2015-12-23 07:20:00 NA NA 878.0
58: 2015-12-23 07:20:10 NA NA 876.0
59: 2015-12-23 07:20:20 NA NA 875.5
60: 2015-12-23 07:20:30 NA NA 875.5
61: 2015-12-23 07:20:40 NA NA 874.0
62: 2015-12-23 07:20:50 NA NA 872.5
63: 2015-12-23 07:21:00 NA NA 870.5
64: 2015-12-23 07:21:10 NA NA 867.0
65: 2015-12-23 07:21:20 NA NA 863.5
66: 2015-12-23 07:21:30 NA NA 860.5
67: 2015-12-23 07:21:40 NA NA 859.0
68: 2015-12-23 07:21:50 NA NA 861.0
69: 2015-12-23 07:22:00 NA NA 864.5
70: 2015-12-23 07:22:10 NA NA 868.5
71: 2015-12-23 07:22:20 NA NA 874.5
72: 2015-12-23 07:22:30 NA NA 882.0
73: 2015-12-23 07:22:40 NA NA 894.0
74: 2015-12-23 07:22:50 NA NA 907.0
75: 2015-12-23 07:23:00 NA NA 922.5
我正在努力做到这一点:
df_new()
Bottom.Start Dive Max.Depth Depth DiveNumber
1: 2015-12-22 01:08:00 NA NA 318.5 1
2: 2015-12-22 01:08:10 NA NA 331.0 1
3: 2015-12-22 01:08:20 NA NA 345.5 1
4: 2015-12-22 01:08:30 NA NA 361.0 1
5: 2015-12-22 01:08:40 NA NA 376.5 1
6: 2015-12-22 01:08:50 NA NA 392.0 1
7: 2015-12-22 01:09:00 NA NA 408.5 1
8: 2015-12-22 01:09:10 NA NA 425.5 1
9: 2015-12-22 01:09:20 NA NA 442.5 1
10: 2015-12-22 01:09:30 NA NA 459.0 1
11: 2015-12-22 01:09:40 NA NA 475.0 1
12: 2015-12-22 01:09:50 NA NA 491.0 1
13: 2015-12-22 01:10:00 1238 727 508.0 1
14: 2015-12-22 01:10:10 NA NA 523.5 1
15: 2015-12-22 01:10:20 NA NA 540.5 1
16: 2015-12-22 01:10:30 NA NA 556.5 1
17: 2015-12-22 01:10:40 NA NA 572.5 1
18: 2015-12-22 01:10:50 NA NA 581.0 1
19: 2015-12-22 01:11:00 NA NA 583.5 1
20: 2015-12-22 01:11:10 NA NA 582.0 1
21: 2015-12-22 01:11:20 NA NA 581.0 1
22: 2015-12-22 01:11:30 NA NA 581.0 1
23: 2015-12-22 01:11:40 NA NA 581.0 1
24: 2015-12-22 01:11:50 NA NA 581.0 1
25: 2015-12-22 01:12:00 NA NA 582.0 1
26: 2015-12-23 07:17:00 NA NA 863.0 2
27: 2015-12-23 07:17:10 NA NA 863.5 2
28: 2015-12-23 07:17:20 NA NA 865.0 2
29: 2015-12-23 07:17:30 NA NA 866.0 2
30: 2015-12-23 07:17:40 NA NA 867.0 2
31: 2015-12-23 07:17:50 NA NA 867.5 2
32: 2015-12-23 07:18:00 NA NA 868.5 2
33: 2015-12-23 07:18:10 NA NA 870.0 2
34: 2015-12-23 07:18:20 NA NA 870.5 2
35: 2015-12-23 07:18:30 NA NA 871.0 2
36: 2015-12-23 07:18:40 NA NA 872.0 2
37: 2015-12-23 07:18:50 NA NA 872.0 2
38: 2015-12-23 07:19:00 1267 970 874.0 2
39: 2015-12-23 07:19:10 NA NA 875.0 2
40: 2015-12-23 07:19:20 NA NA 875.0 2
41: 2015-12-23 07:19:30 NA NA 876.0 2
42: 2015-12-23 07:19:40 NA NA 876.5 2
43: 2015-12-23 07:19:50 NA NA 876.0 2
44: 2015-12-23 07:20:00 NA NA 878.0 2
45: 2015-12-23 07:20:10 NA NA 876.0 2
46: 2015-12-23 07:20:20 NA NA 875.5 2
47: 2015-12-23 07:20:30 NA NA 875.5 2
48: 2015-12-23 07:20:40 NA NA 874.0 2
49: 2015-12-23 07:20:50 NA NA 872.5 2
50: 2015-12-23 07:21:00 NA NA 870.5 2
这是我的 dput()
用于数据框:
structure(list(Bottom.Start = structure(c(1450746420, 1450746430,
1450746440, 1450746450, 1450746460, 1450746470, 1450746480, 1450746490,
1450746500, 1450746510, 1450746520, 1450746530, 1450746540, 1450746550,
1450746560, 1450746570, 1450746580, 1450746590, 1450746600, 1450746610,
1450746620, 1450746630, 1450746640, 1450746650, 1450746660, 1450746670,
1450746680, 1450746690, 1450746700, 1450746710, 1450746720, 1450746730,
1450746740, 1450746750, 1450746760, 1450746770, 1450746780, 1450746790,
1450855020, 1450855030, 1450855040, 1450855050, 1450855060, 1450855070,
1450855080, 1450855090, 1450855100, 1450855110, 1450855120, 1450855130,
1450855140, 1450855150, 1450855160, 1450855170, 1450855180, 1450855190,
1450855200, 1450855210, 1450855220, 1450855230, 1450855240, 1450855250,
1450855260, 1450855270, 1450855280, 1450855290, 1450855300, 1450855310,
1450855320, 1450855330, 1450855340, 1450855350, 1450855360, 1450855370,
1450855380), tzone = "UTC", class = c("POSIXct", "POSIXt")),
Dive = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 1238L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1267L, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), Max.Depth = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 727, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
970, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Depth = c(311,
308.5, 307, 306.5, 305.5, 308.5, 318.5, 331, 345.5, 361,
376.5, 392, 408.5, 425.5, 442.5, 459, 475, 491, 508, 523.5,
540.5, 556.5, 572.5, 581, 583.5, 582, 581, 581, 581, 581,
582, 581.5, 581.5, 592.5, 606.5, 621.5, 637.5, 655, 863,
863.5, 865, 866, 867, 867.5, 868.5, 870, 870.5, 871, 872,
872, 874, 875, 875, 876, 876.5, 876, 878, 876, 875.5, 875.5,
874, 872.5, 870.5, 867, 863.5, 860.5, 859, 861, 864.5, 868.5,
874.5, 882, 894, 907, 922.5)), row.names = c(NA, -75L), class = c("data.table",
"data.frame"))
我尝试了lag()
和tail()
以及包括case_when()
和mutate()
在内的dplyr 参数的组合,但似乎无法获得我想要的输出。我对 R 比较陌生,感谢您的帮助!
【问题讨论】:
【参考方案1】:我不确定合并前您的原始 data.frames 是什么样的,但这是需要考虑的事情(使用 data.table
可能相对更快)。
首先,您可以创建(或者您可能已经拥有?)一个包含完整案例的小型 data.frame。您可以枚举潜水并确定您对这些数据行感兴趣的时间范围。
然后,您可以执行非等连接并选择在确定的时间范围内具有Bottom.Start
次的数据行。
library(data.table)
library(lubridate)
df_c <- df[complete.cases(df), ]
df_c[, `:=` (DiveNumber = seq_len(.N), Start = Bottom.Start - minutes(2), End = Bottom.Start + minutes(2))]
df[df_c, .(Bottom.Start = x.Bottom.Start, Dive, Max.Depth, Depth, DiveNumber), on = .(Bottom.Start >= Start, Bottom.Start <= End)]
输出
Bottom.Start Dive Max.Depth Depth DiveNumber
1: 2015-12-22 01:08:00 NA NA 318.5 1
2: 2015-12-22 01:08:10 NA NA 331.0 1
3: 2015-12-22 01:08:20 NA NA 345.5 1
4: 2015-12-22 01:08:30 NA NA 361.0 1
5: 2015-12-22 01:08:40 NA NA 376.5 1
6: 2015-12-22 01:08:50 NA NA 392.0 1
7: 2015-12-22 01:09:00 NA NA 408.5 1
8: 2015-12-22 01:09:10 NA NA 425.5 1
9: 2015-12-22 01:09:20 NA NA 442.5 1
10: 2015-12-22 01:09:30 NA NA 459.0 1
11: 2015-12-22 01:09:40 NA NA 475.0 1
12: 2015-12-22 01:09:50 NA NA 491.0 1
13: 2015-12-22 01:10:00 1238 727 508.0 1
14: 2015-12-22 01:10:10 NA NA 523.5 1
15: 2015-12-22 01:10:20 NA NA 540.5 1
16: 2015-12-22 01:10:30 NA NA 556.5 1
17: 2015-12-22 01:10:40 NA NA 572.5 1
18: 2015-12-22 01:10:50 NA NA 581.0 1
19: 2015-12-22 01:11:00 NA NA 583.5 1
20: 2015-12-22 01:11:10 NA NA 582.0 1
21: 2015-12-22 01:11:20 NA NA 581.0 1
22: 2015-12-22 01:11:30 NA NA 581.0 1
23: 2015-12-22 01:11:40 NA NA 581.0 1
24: 2015-12-22 01:11:50 NA NA 581.0 1
25: 2015-12-22 01:12:00 NA NA 582.0 1
26: 2015-12-23 07:17:00 NA NA 863.0 2
27: 2015-12-23 07:17:10 NA NA 863.5 2
28: 2015-12-23 07:17:20 NA NA 865.0 2
29: 2015-12-23 07:17:30 NA NA 866.0 2
30: 2015-12-23 07:17:40 NA NA 867.0 2
31: 2015-12-23 07:17:50 NA NA 867.5 2
32: 2015-12-23 07:18:00 NA NA 868.5 2
33: 2015-12-23 07:18:10 NA NA 870.0 2
34: 2015-12-23 07:18:20 NA NA 870.5 2
35: 2015-12-23 07:18:30 NA NA 871.0 2
36: 2015-12-23 07:18:40 NA NA 872.0 2
37: 2015-12-23 07:18:50 NA NA 872.0 2
38: 2015-12-23 07:19:00 1267 970 874.0 2
39: 2015-12-23 07:19:10 NA NA 875.0 2
40: 2015-12-23 07:19:20 NA NA 875.0 2
41: 2015-12-23 07:19:30 NA NA 876.0 2
42: 2015-12-23 07:19:40 NA NA 876.5 2
43: 2015-12-23 07:19:50 NA NA 876.0 2
44: 2015-12-23 07:20:00 NA NA 878.0 2
45: 2015-12-23 07:20:10 NA NA 876.0 2
46: 2015-12-23 07:20:20 NA NA 875.5 2
47: 2015-12-23 07:20:30 NA NA 875.5 2
48: 2015-12-23 07:20:40 NA NA 874.0 2
49: 2015-12-23 07:20:50 NA NA 872.5 2
50: 2015-12-23 07:21:00 NA NA 870.5 2
Bottom.Start Dive Max.Depth Depth DiveNumber
【讨论】:
本,非常感谢!这非常有效:D 您介意用外行的术语解释最后两行代码的含义吗?尤其是 ':=' 和 '.' 的作用是什么?为什么最后一行有“x.Bottom.Start”?很想了解以备将来使用:):=
是data.table
中的引用赋值,没有复制数据。有关详细信息,请参阅this。在上述情况下,我只是添加了 3 个新列(一个用于 DiveNumber
计数,另外 2 个用于开始和结束时间)...至于 x.
添加到 Bottom.Start
,这可以让您参考到连接中第一个data.table的特定列,否则它们被另一个屏蔽...data.table
语法中的句点.
只是list
...
关于非等连接中使用的x.
前缀的更多细节可以在here找到。【参考方案2】:
您可以尝试使用difftime
lag
和cumsum
-
library(dplyr)
df %>%
mutate(DiveNumber = cumsum(c(TRUE, difftime(Bottom.Start,
lag(Bottom.Start), units = 'mins')[-1] > 2)))
# Bottom.Start Dive Max.Depth Depth DiveNumber
# 1: 2015-12-22 01:07:00 NA NA 311.0 1
# 2: 2015-12-22 01:07:10 NA NA 308.5 1
# 3: 2015-12-22 01:07:20 NA NA 307.0 1
# 4: 2015-12-22 01:07:30 NA NA 306.5 1
# 5: 2015-12-22 01:07:40 NA NA 305.5 1
# 6: 2015-12-22 01:07:50 NA NA 308.5 1
# 7: 2015-12-22 01:08:00 NA NA 318.5 1
# 8: 2015-12-22 01:08:10 NA NA 331.0 1
#...
#...
#37: 2015-12-22 01:13:00 NA NA 637.5 1
#38: 2015-12-22 01:13:10 NA NA 655.0 1
#39: 2015-12-23 07:17:00 NA NA 863.0 2
#40: 2015-12-23 07:17:10 NA NA 863.5 2
#41: 2015-12-23 07:17:20 NA NA 865.0 2
#42: 2015-12-23 07:17:30 NA NA 866.0 2
#43: 2015-12-23 07:17:40 NA NA 867.0 2
#...
#...
#74: 2015-12-23 07:22:50 NA NA 907.0 2
#75: 2015-12-23 07:23:00 NA NA 922.5 2
对于 12 小时的实际目标,您可以将 units = 'mins'
更改为 units = 'hours'
并将 > 2
更改为 > 12
。
【讨论】:
以上是关于如何从两个合并的数据帧中选择完成之前和之后的特定时间间隔?的主要内容,如果未能解决你的问题,请参考以下文章
spark:合并两个数据帧,如果两个数据帧中的ID重复,则df1中的行覆盖df2中的行
如何在 Scala 中连接两个数据帧并通过索引从数据帧中选择几列?