data.table相当于dplyr :: filter_at

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了data.table相当于dplyr :: filter_at相关的知识,希望对你有一定的参考价值。

考虑一下数据:

library(data.table)
library(magrittr)

vec1 <- c("Iron", "Copper")

vec2 <- c("Defective", "Passed", "Error")

set.seed(123)
a1 <- sample(x = vec1, size = 20, replace = T)
b1 <- sample(x = vec2, size = 20, replace = T)

set.seed(1234)
a2 <- sample(x = vec1, size = 20, replace = T)
b2 <- sample(x = vec2, size = 20, replace = T)

DT <- data.table(
  c(1:20), a1, b1, a2, b2
) %>% .[order(V1)]

names(DT) <- c("id", "prod_name_1", "test_1", "prod_name_2", "test_2")

我需要过滤test_1test_2的值为"Passed"的行。因此,如果这些列都没有指定的值,则删除该行。使用dplyr,我们可以使用filter_at()动词:

> # dplyr solution...
> 
> cols <- grep(x = names(DT), pattern = "test", value = T, ignore.case = T)
> 
> 
> DT %>% 
+   dplyr::filter_at(.vars = grep(x = names(DT), pattern = "test", value = T, ignore.case = T), 
+                    dplyr::any_vars(. == "Passed")) -> DT.2
> 
> DT.2
  id prod_name_1 test_1 prod_name_2    test_2
1  3        Iron Passed      Copper Defective
2  5      Copper Passed      Copper Defective
3  7      Copper Passed        Iron    Passed
4  8      Copper Passed        Iron     Error
5 11      Copper  Error      Copper    Passed
6 14      Copper  Error      Copper    Passed
7 16      Copper Passed      Copper     Error

凉。在data.table中有没有类似的方法来执行此操作?

这是我最接近的:

> lapply(seq_along(cols), function(x){
+   
+   setkeyv(DT, cols[[x]])
+   
+   DT["Passed"]
+   
+ }) %>% 
+   do.call(rbind,.) %>% 
+   unique -> DT.3
> 
> DT.3
   id prod_name_1 test_1 prod_name_2    test_2
1:  3        Iron Passed      Copper Defective
2:  5      Copper Passed      Copper Defective
3:  8      Copper Passed        Iron     Error
4: 16      Copper Passed      Copper     Error
5:  7      Copper Passed        Iron    Passed
6: 11      Copper  Error      Copper    Passed
7: 14      Copper  Error      Copper    Passed
> 
> identical(data.table(DT.2)[order(id)], DT.3[order(id)])
[1] TRUE

你们有没有更优雅的解决方案?最好包含在像dplyr::filter_at()这样的动词中。

答案

我们可以在.SDcols中指定'cols',循环遍历Data.table的子集(.SD)来比较值是否为“Passed”,Reduce将它与vector的单个|进行比较并对行进行子集化

res2 <- DT[DT[,  Reduce(`|`, lapply(.SD, `==`, "Passed")), .SDcols = cols]]

与OP的帖子中的dplyr输出相比较

identical(as.data.table(res1), res2)
#[1] TRUE
另一答案

我要转换数据......

# store the data in long form...

m = melt(DT, id = "id", 
  meas = patterns("prod_name", "test"), 
  value.name = c("prod_name", "test"), variable.name = "prod_num")

setorder(m, id, prod_num)      

# store binary test variable as logical...

testmap = data.table(
  old = c("Defective", "Passed", "Error"), 
  new = c(FALSE, TRUE, NA))
m[testmap, on=.(test = old), passed := i.new]

m[, test := NULL]

所以数据现在看起来像

    id prod_num prod_name passed
 1:  1        1      Iron     NA
 2:  1        2      Iron  FALSE
 3:  2        1    Copper     NA
 4:  2        2    Copper  FALSE
 5:  3        1      Iron   TRUE
 6:  3        2    Copper  FALSE
 7:  4        1    Copper     NA
 8:  4        2    Copper  FALSE
 9:  5        1    Copper   TRUE
10:  5        2    Copper  FALSE
11:  6        1      Iron     NA
12:  6        2    Copper     NA
13:  7        1    Copper   TRUE
14:  7        2      Iron   TRUE
15:  8        1    Copper   TRUE
16:  8        2      Iron     NA
17:  9        1    Copper  FALSE
18:  9        2    Copper     NA
19: 10        1      Iron  FALSE
20: 10        2    Copper  FALSE
21: 11        1    Copper     NA
22: 11        2    Copper   TRUE
23: 12        1      Iron     NA
24: 12        2    Copper  FALSE
25: 13        1    Copper     NA
26: 13        2      Iron  FALSE
27: 14        1    Copper     NA
28: 14        2    Copper   TRUE
29: 15        1      Iron  FALSE
30: 15        2      Iron  FALSE
31: 16        1    Copper   TRUE
32: 16        2    Copper     NA
33: 17        1      Iron     NA
34: 17        2      Iron  FALSE
35: 18        1      Iron  FALSE
36: 18        2      Iron  FALSE
37: 19        1      Iron  FALSE
38: 19        2      Iron     NA
39: 20        1    Copper  FALSE
40: 20        2      Iron     NA
    id prod_num prod_name passed

然后,您可以使用传递的产品过滤到ID ...

res = m[, if(isTRUE(any(passed))) .SD, by=id]

    id prod_num prod_name passed
 1:  3        1      Iron   TRUE
 2:  3        2    Copper  FALSE
 3:  5        1    Copper   TRUE
 4:  5        2    Copper  FALSE
 5:  7        1    Copper   TRUE
 6:  7        2      Iron   TRUE
 7:  8        1    Copper   TRUE
 8:  8        2      Iron     NA
 9: 11        1    Copper     NA
10: 11        2    Copper   TRUE
11: 14        1    Copper     NA
12: 14        2    Copper   TRUE
13: 16        1    Copper   TRUE
14: 16        2    Copper     NA

可浏览性......

dcast(res, id ~ prod_num, value.var = c("prod_name", "passed"))

   id prod_name_1 prod_name_2 passed_1 passed_2
1:  3        Iron      Copper     TRUE    FALSE
2:  5      Copper      Copper     TRUE    FALSE
3:  7      Copper        Iron     TRUE     TRUE
4:  8      Copper        Iron     TRUE       NA
5: 11      Copper      Copper       NA     TRUE
6: 14      Copper      Copper       NA     TRUE
7: 16      Copper      Copper     TRUE       NA

以上是关于data.table相当于dplyr :: filter_at的主要内容,如果未能解决你的问题,请参考以下文章

重新审视 data.table 与 dplyr 的内存使用

一个接一个地加载 Data.Table 和 dplyr 会出错

data.table 按行求和,平均值,最小值,最大值,如 dplyr?

使用 count()、aggregate()、data.table() 或 dplyr() 汇总数据(均值、标准差)

data.table vs dplyr:一个人可以做得很好,而另一个人不能或做得很差?

使用dplyr汇总并保持相同的变量名称