data.table相当于dplyr :: filter_at
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了data.table相当于dplyr :: filter_at相关的知识,希望对你有一定的参考价值。
考虑一下数据:
library(data.table)
library(magrittr)
vec1 <- c("Iron", "Copper")
vec2 <- c("Defective", "Passed", "Error")
set.seed(123)
a1 <- sample(x = vec1, size = 20, replace = T)
b1 <- sample(x = vec2, size = 20, replace = T)
set.seed(1234)
a2 <- sample(x = vec1, size = 20, replace = T)
b2 <- sample(x = vec2, size = 20, replace = T)
DT <- data.table(
c(1:20), a1, b1, a2, b2
) %>% .[order(V1)]
names(DT) <- c("id", "prod_name_1", "test_1", "prod_name_2", "test_2")
我需要过滤test_1
或test_2
的值为"Passed"
的行。因此,如果这些列都没有指定的值,则删除该行。使用dplyr
,我们可以使用filter_at()
动词:
> # dplyr solution...
>
> cols <- grep(x = names(DT), pattern = "test", value = T, ignore.case = T)
>
>
> DT %>%
+ dplyr::filter_at(.vars = grep(x = names(DT), pattern = "test", value = T, ignore.case = T),
+ dplyr::any_vars(. == "Passed")) -> DT.2
>
> DT.2
id prod_name_1 test_1 prod_name_2 test_2
1 3 Iron Passed Copper Defective
2 5 Copper Passed Copper Defective
3 7 Copper Passed Iron Passed
4 8 Copper Passed Iron Error
5 11 Copper Error Copper Passed
6 14 Copper Error Copper Passed
7 16 Copper Passed Copper Error
凉。在data.table
中有没有类似的方法来执行此操作?
这是我最接近的:
> lapply(seq_along(cols), function(x){
+
+ setkeyv(DT, cols[[x]])
+
+ DT["Passed"]
+
+ }) %>%
+ do.call(rbind,.) %>%
+ unique -> DT.3
>
> DT.3
id prod_name_1 test_1 prod_name_2 test_2
1: 3 Iron Passed Copper Defective
2: 5 Copper Passed Copper Defective
3: 8 Copper Passed Iron Error
4: 16 Copper Passed Copper Error
5: 7 Copper Passed Iron Passed
6: 11 Copper Error Copper Passed
7: 14 Copper Error Copper Passed
>
> identical(data.table(DT.2)[order(id)], DT.3[order(id)])
[1] TRUE
你们有没有更优雅的解决方案?最好包含在像dplyr::filter_at()
这样的动词中。
答案
我们可以在.SDcols
中指定'cols',循环遍历Data.table的子集(.SD
)来比较值是否为“Passed”,Reduce
将它与vector
的单个|
进行比较并对行进行子集化
res2 <- DT[DT[, Reduce(`|`, lapply(.SD, `==`, "Passed")), .SDcols = cols]]
与OP的帖子中的dplyr
输出相比较
identical(as.data.table(res1), res2)
#[1] TRUE
另一答案
我要转换数据......
# store the data in long form...
m = melt(DT, id = "id",
meas = patterns("prod_name", "test"),
value.name = c("prod_name", "test"), variable.name = "prod_num")
setorder(m, id, prod_num)
# store binary test variable as logical...
testmap = data.table(
old = c("Defective", "Passed", "Error"),
new = c(FALSE, TRUE, NA))
m[testmap, on=.(test = old), passed := i.new]
m[, test := NULL]
所以数据现在看起来像
id prod_num prod_name passed
1: 1 1 Iron NA
2: 1 2 Iron FALSE
3: 2 1 Copper NA
4: 2 2 Copper FALSE
5: 3 1 Iron TRUE
6: 3 2 Copper FALSE
7: 4 1 Copper NA
8: 4 2 Copper FALSE
9: 5 1 Copper TRUE
10: 5 2 Copper FALSE
11: 6 1 Iron NA
12: 6 2 Copper NA
13: 7 1 Copper TRUE
14: 7 2 Iron TRUE
15: 8 1 Copper TRUE
16: 8 2 Iron NA
17: 9 1 Copper FALSE
18: 9 2 Copper NA
19: 10 1 Iron FALSE
20: 10 2 Copper FALSE
21: 11 1 Copper NA
22: 11 2 Copper TRUE
23: 12 1 Iron NA
24: 12 2 Copper FALSE
25: 13 1 Copper NA
26: 13 2 Iron FALSE
27: 14 1 Copper NA
28: 14 2 Copper TRUE
29: 15 1 Iron FALSE
30: 15 2 Iron FALSE
31: 16 1 Copper TRUE
32: 16 2 Copper NA
33: 17 1 Iron NA
34: 17 2 Iron FALSE
35: 18 1 Iron FALSE
36: 18 2 Iron FALSE
37: 19 1 Iron FALSE
38: 19 2 Iron NA
39: 20 1 Copper FALSE
40: 20 2 Iron NA
id prod_num prod_name passed
然后,您可以使用传递的产品过滤到ID ...
res = m[, if(isTRUE(any(passed))) .SD, by=id]
id prod_num prod_name passed
1: 3 1 Iron TRUE
2: 3 2 Copper FALSE
3: 5 1 Copper TRUE
4: 5 2 Copper FALSE
5: 7 1 Copper TRUE
6: 7 2 Iron TRUE
7: 8 1 Copper TRUE
8: 8 2 Iron NA
9: 11 1 Copper NA
10: 11 2 Copper TRUE
11: 14 1 Copper NA
12: 14 2 Copper TRUE
13: 16 1 Copper TRUE
14: 16 2 Copper NA
可浏览性......
dcast(res, id ~ prod_num, value.var = c("prod_name", "passed"))
id prod_name_1 prod_name_2 passed_1 passed_2
1: 3 Iron Copper TRUE FALSE
2: 5 Copper Copper TRUE FALSE
3: 7 Copper Iron TRUE TRUE
4: 8 Copper Iron TRUE NA
5: 11 Copper Copper NA TRUE
6: 14 Copper Copper NA TRUE
7: 16 Copper Copper TRUE NA
以上是关于data.table相当于dplyr :: filter_at的主要内容,如果未能解决你的问题,请参考以下文章
一个接一个地加载 Data.Table 和 dplyr 会出错
data.table 按行求和,平均值,最小值,最大值,如 dplyr?
使用 count()、aggregate()、data.table() 或 dplyr() 汇总数据(均值、标准差)