在R中具有相同组ID的另一个数据帧中按值过滤分组数据帧

Posted

技术标签:

【中文标题】在R中具有相同组ID的另一个数据帧中按值过滤分组数据帧【英文标题】:Filter grouped dataframe by value in another dataframe with same group IDs in R 【发布时间】:2022-01-22 04:15:51 【问题描述】:

我希望有人可以帮助我解决我在 R 中处理大型数据集的问题。我有一个数据框,其中包含数千个树 ID、测量年份和给定测量年份的直径。我想过滤这个数据框,以便只保留相邻树死亡之前的行。我有第二个数据框,其中包含树 ID 和相邻树死亡的年份,以提供过滤的年份。

原始df的一小部分(仅4棵树):

   tree_id year diameter
1       T1 1978     48.2
2       T1 1990     48.6
3       T1 1995     49.0
4       T1 2002     49.6
5       T1 2008     50.3
6       T1 2012     50.4
7       T1 2017     50.6
8       T2 1978     76.3
9       T2 1984     76.8
10      T2 1990     77.3
11      T2 1995     78.7
12      T2 2002     79.5
13      T2 2008     80.6
14      T2 2012     81.1
15      T2 2017     81.6
16      T3 1978     15.7
17      T3 1984     16.5
18      T3 1990     17.7
19      T3 1995     18.3
20      T3 2002     19.3
21      T3 2008     20.0
22      T3 2012     20.0
23      T3 2017     20.2
24      T4 1978     50.5
25      T4 1984     51.2
26      T4 1990     51.9
27      T4 1995     52.5
28      T4 2002     53.2
29      T4 2008     54.8
30      T4 2012     53.7
31      T4 2017     54.0

这是我想通过以下方式过滤的数据框:

  tree_id neb_death
1      T1      2002
2      T2      2008
3      T3      1995
4      T4      2012

例如,我只想在较大的数据框中保留 tree_id = T1 的行,测量年份在 2002 年之前。我非常感谢使用基本 R 或 dplyr 方法的任何帮助。 谢谢!

【问题讨论】:

第二个data.frame的作用是什么?在您的示例中,这不是您想要的吗: df_example 【参考方案1】:

您可以使用 data.table 加入匹配的tree_idyear < neb_death。如果第一个表是df,第二个是df2

library(data.table)
setDT(df)
setDT(df2)

df[df2, on = .(tree_id, year < neb_death)]
#>     tree_id  year diameter
#>      <char> <int>    <num>
#>  1:      T1  2002     48.2
#>  2:      T1  2002     48.6
#>  3:      T1  2002     49.0
#>  4:      T2  2008     76.3
#>  5:      T2  2008     76.8
#>  6:      T2  2008     77.3
#>  7:      T2  2008     78.7
#>  8:      T2  2008     79.5
#>  9:      T3  1995     15.7
#> 10:      T3  1995     16.5
#> 11:      T3  1995     17.7
#> 12:      T4  2012     50.5
#> 13:      T4  2012     51.2
#> 14:      T4  2012     51.9
#> 15:      T4  2012     52.5
#> 16:      T4  2012     53.2
#> 17:      T4  2012     54.8

由reprex package (v2.0.1) 于 2021-12-20 创建

使用的数据

df <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1", 
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3", 
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4", 
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L, 
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L, 
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2, 
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5, 
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2, 
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), row.names = c(NA, 
-31L), class = "data.frame")

df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L, 
2008L, 1995L, 2012L)), row.names = c(NA, -4L), class = "data.frame")

【讨论】:

感谢@IceCreamToucan!那很完美! @TarJae,您的解决方案也可以完成!我在这个上放了一个真正的大脑放屁。你们都摇滚:)【参考方案2】:

我们可以先left_join tree_id 然后filter:

library(dplyr)

left_join(df, df1, by="tree_id") %>% 
  filter(year < neb_death) %>% 
  select(-neb_death)

输出:

   tree_id  year diameter
   <chr>   <int>    <dbl>
 1 T1       1978     48.2
 2 T1       1990     48.6
 3 T1       1995     49  
 4 T2       1978     76.3
 5 T2       1984     76.8
 6 T2       1990     77.3
 7 T2       1995     78.7
 8 T2       2002     79.5
 9 T3       1978     15.7
10 T3       1984     16.5
11 T3       1990     17.7
12 T4       1978     50.5
13 T4       1984     51.2
14 T4       1990     51.9
15 T4       1995     52.5
16 T4       2002     53.2
17 T4       2008     54.8

【讨论】:

啊。清除。非常感谢@IceCream Toucan。【参考方案3】:

使用基础 R,我们可以逐行循环 df2,识别 df1 中不需要的观察结果并将其删除。

df1[-unlist(apply(df2, 1, \(x) which(df1$tree_id == x[1] & df1$year >= x[2]))), ]
#    tree_id year diameter
# 1       T1 1978     48.2
# 2       T1 1990     48.6
# 3       T1 1995     49.0
# 8       T2 1978     76.3
# 9       T2 1984     76.8
# 10      T2 1990     77.3
# 11      T2 1995     78.7
# 12      T2 2002     79.5
# 16      T3 1978     15.7
# 17      T3 1984     16.5
# 18      T3 1990     17.7
# 24      T4 1978     50.5
# 25      T4 1984     51.2
# 26      T4 1990     51.9
# 27      T4 1995     52.5
# 28      T4 2002     53.2
# 29      T4 2008     54.8

数据:

df1 <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1", 
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3", 
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4", 
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L, 
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L, 
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2, 
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5, 
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2, 
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
"25", "26", "27", "28", "29", "30", "31"))

df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L, 
2008L, 1995L, 2012L)), class = "data.frame", row.names = c("1", 
"2", "3", "4"))

【讨论】:

以上是关于在R中具有相同组ID的另一个数据帧中按值过滤分组数据帧的主要内容,如果未能解决你的问题,请参考以下文章

如何根据 R 中的另一个数据帧解码一个数据帧中变量的值?

如何在reactjs / javascript中按值对对象数组进行分组

如何在 C# 字典中按值索引过滤项目?

如何在javascript中按id嵌套数据过滤和分组?

在elasticsearch中,如何在嵌套数组中按值分组

XSLT muenchian 在子节点中按值分组