在R中具有相同组ID的另一个数据帧中按值过滤分组数据帧
Posted
技术标签:
【中文标题】在R中具有相同组ID的另一个数据帧中按值过滤分组数据帧【英文标题】:Filter grouped dataframe by value in another dataframe with same group IDs in R 【发布时间】:2022-01-22 04:15:51 【问题描述】:我希望有人可以帮助我解决我在 R 中处理大型数据集的问题。我有一个数据框,其中包含数千个树 ID、测量年份和给定测量年份的直径。我想过滤这个数据框,以便只保留相邻树死亡之前的行。我有第二个数据框,其中包含树 ID 和相邻树死亡的年份,以提供过滤的年份。
原始df的一小部分(仅4棵树):
tree_id year diameter
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49.0
4 T1 2002 49.6
5 T1 2008 50.3
6 T1 2012 50.4
7 T1 2017 50.6
8 T2 1978 76.3
9 T2 1984 76.8
10 T2 1990 77.3
11 T2 1995 78.7
12 T2 2002 79.5
13 T2 2008 80.6
14 T2 2012 81.1
15 T2 2017 81.6
16 T3 1978 15.7
17 T3 1984 16.5
18 T3 1990 17.7
19 T3 1995 18.3
20 T3 2002 19.3
21 T3 2008 20.0
22 T3 2012 20.0
23 T3 2017 20.2
24 T4 1978 50.5
25 T4 1984 51.2
26 T4 1990 51.9
27 T4 1995 52.5
28 T4 2002 53.2
29 T4 2008 54.8
30 T4 2012 53.7
31 T4 2017 54.0
这是我想通过以下方式过滤的数据框:
tree_id neb_death
1 T1 2002
2 T2 2008
3 T3 1995
4 T4 2012
例如,我只想在较大的数据框中保留 tree_id = T1 的行,测量年份在 2002 年之前。我非常感谢使用基本 R 或 dplyr 方法的任何帮助。 谢谢!
【问题讨论】:
第二个data.frame的作用是什么?在您的示例中,这不是您想要的吗: df_example 【参考方案1】:您可以使用 data.table 加入匹配的tree_id
和year < neb_death
。如果第一个表是df
,第二个是df2
:
library(data.table)
setDT(df)
setDT(df2)
df[df2, on = .(tree_id, year < neb_death)]
#> tree_id year diameter
#> <char> <int> <num>
#> 1: T1 2002 48.2
#> 2: T1 2002 48.6
#> 3: T1 2002 49.0
#> 4: T2 2008 76.3
#> 5: T2 2008 76.8
#> 6: T2 2008 77.3
#> 7: T2 2008 78.7
#> 8: T2 2008 79.5
#> 9: T3 1995 15.7
#> 10: T3 1995 16.5
#> 11: T3 1995 17.7
#> 12: T4 2012 50.5
#> 13: T4 2012 51.2
#> 14: T4 2012 51.9
#> 15: T4 2012 52.5
#> 16: T4 2012 53.2
#> 17: T4 2012 54.8
由reprex package (v2.0.1) 于 2021-12-20 创建
使用的数据
df <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), row.names = c(NA,
-31L), class = "data.frame")
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), row.names = c(NA, -4L), class = "data.frame")
【讨论】:
感谢@IceCreamToucan!那很完美! @TarJae,您的解决方案也可以完成!我在这个上放了一个真正的大脑放屁。你们都摇滚:)【参考方案2】:我们可以先left_join
tree_id
然后filter
:
library(dplyr)
left_join(df, df1, by="tree_id") %>%
filter(year < neb_death) %>%
select(-neb_death)
输出:
tree_id year diameter
<chr> <int> <dbl>
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49
4 T2 1978 76.3
5 T2 1984 76.8
6 T2 1990 77.3
7 T2 1995 78.7
8 T2 2002 79.5
9 T3 1978 15.7
10 T3 1984 16.5
11 T3 1990 17.7
12 T4 1978 50.5
13 T4 1984 51.2
14 T4 1990 51.9
15 T4 1995 52.5
16 T4 2002 53.2
17 T4 2008 54.8
【讨论】:
啊。清除。非常感谢@IceCream Toucan。【参考方案3】:使用基础 R,我们可以逐行循环 df2
,识别 df1
中不需要的观察结果并将其删除。
df1[-unlist(apply(df2, 1, \(x) which(df1$tree_id == x[1] & df1$year >= x[2]))), ]
# tree_id year diameter
# 1 T1 1978 48.2
# 2 T1 1990 48.6
# 3 T1 1995 49.0
# 8 T2 1978 76.3
# 9 T2 1984 76.8
# 10 T2 1990 77.3
# 11 T2 1995 78.7
# 12 T2 2002 79.5
# 16 T3 1978 15.7
# 17 T3 1984 16.5
# 18 T3 1990 17.7
# 24 T4 1978 50.5
# 25 T4 1984 51.2
# 26 T4 1990 51.9
# 27 T4 1995 52.5
# 28 T4 2002 53.2
# 29 T4 2008 54.8
数据:
df1 <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30", "31"))
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
【讨论】:
以上是关于在R中具有相同组ID的另一个数据帧中按值过滤分组数据帧的主要内容,如果未能解决你的问题,请参考以下文章