计算数据框中特定列(SUM、AVG、STDEV)的所有嵌套级别聚合
Posted
技术标签:
【中文标题】计算数据框中特定列(SUM、AVG、STDEV)的所有嵌套级别聚合【英文标题】:Calculating ALL nested level aggregations of specific Column ( SUM, AVG, STDEV) in dataframe 【发布时间】:2020-05-09 21:09:54 【问题描述】:我有一个如下所示的表格(简化):
col_A col_B col_C
A 37 2
B 28 7
C 10 5
D 11 5
E 99 4
我想得到一个表,其中包含 col_A 的每个级别的所有嵌套组合,并计算子组内的平均值:例如,choose-any-2 表看起来像(10 个唯一级别组合):
Grp_2 AVG (col_B/col_C)
A,B 7.76
A,C 6.61
A,D 7.55
… …
D,E 12.99
Choose-any-4 看起来像(5 个独特的关卡组合):
Grp_4 AVG (col_B/col_C)
A,B,C,D 7.84
A,B,C,E 6.68
A,C,D,E 7.63
… …
B,C,D,E 13.12
(order od preference) R, SQL(postgres, ANSI) , Python.;
我目前在 R 中的解决方案(如下)不能很好地扩展col_A
的级别数增长:
require(tidyverse)
df <- tibble(col_A=c("A", "B","C", "D", "E"), col_B=c(37,28,10,11,99), col_C=c(2,7,5,5,4))
nested_subgroup_agg <- function(choice = 2, mydf = NULL)
library(tidyverse)
dfx <-
combn(c("A", "B", "C", "D", "E"), choice) %>%
t() %>%
as_tibble()
try(if (choice <= 1)
stop("Can't Choose less than 2 levels at a time")
else
if (choice == 2)
val <- map_dbl(1:nrow(dfx), function(i)
(mydf$col_B[mydf$col_A == dfx$V1[i]] + mydf$col_B[mydf$col_A == dfx$V2[i]]) /
(mydf$col_C[mydf$col_A == dfx$V1[i]] + mydf$col_C[mydf$col_A == dfx$V2[i]])
)
else
if (choice == 3)
val <- map_dbl(1:nrow(dfx), function(i)
(mydf$col_B[mydf$col_A == dfx$V1[i]] + mydf$col_B[mydf$col_A == dfx$V2[i]] + mydf$col_B[mydf$col_A == dfx$V3[i]]) /
(mydf$col_C[mydf$col_A == dfx$V1[i]] + mydf$col_C[mydf$col_A == dfx$V2[i]] + mydf$col_C[mydf$col_A == dfx$V3[i]])
)
else
if (choice == 4)
val <- map_dbl(1:nrow(dfx), function(i)
(mydf$col_B[mydf$col_A == dfx$V1[i]] + mydf$col_B[mydf$col_A == dfx$V2[i]] + mydf$col_B[mydf$col_A == dfx$V3[i]] + mydf$col_B[mydf$col_A == dfx$V4[i]]) /
(mydf$col_C[mydf$col_A == dfx$V1[i]] + mydf$col_C[mydf$col_A == dfx$V2[i]] + mydf$col_C[mydf$col_A == dfx$V3[i]] + mydf$col_C[mydf$col_A == dfx$V4[i]])
)
)
dfx$val <- val
dfx
## Example
df <-
tibble(
col_A = c("A", "B", "C", "D", "E"),
col_B = c(37, 28, 10, 11, 99),
col_C = c(2, 7, 5, 5, 4)
)
nested_subgroup_agg(choice = 4, mydf = df)
你能帮忙改进吗?
【问题讨论】:
我删除了 SQL 标签,因为您的问题是关于 R 中的数据框。 【参考方案1】:使用 data.table 的选项:
nested_subgroup_agg <- function(choice=2, mydf)
ans <- setDT(mydf)[.(g=rep(seq(choose(.N, choice)), each=choice), col_A=c(combn(col_A, choice))), on=.(col_A)][,
.(toString(col_A), sum(col_B) / sum(col_C)), g]
setnames(ans, names(ans)[-1L], c(paste0("Grp_", choice), "val"))[]
nested_subgroup_agg(3, DT)
输出:
g Grp_3 val
1: 1 A, B, C 5.357143
2: 2 A, B, D 5.428571
3: 3 A, B, E 12.615385
4: 4 A, C, D 4.833333
5: 5 A, C, E 13.272727
6: 6 A, D, E 13.363636
7: 7 B, C, D 2.882353
8: 8 B, C, E 8.562500
9: 9 B, D, E 8.625000
10: 10 C, D, E 8.571429
数据:
library(data.table)
DT <- fread("col_A col_B col_C
A 37 2
B 28 7
C 10 5
D 11 5
E 99 4")
【讨论】:
【参考方案2】:一个想法是使用combn
来获取行的所有组合(考虑到每行有 1 个字母),然后简单地每 2 行聚合一次,即
#get a df with all combination of rows
new_d <- dd[c(combn(nrow(dd), 2)),]
#Aggregate
#You can use `aggregate` or `lapply(split())`
lapply(split(new_d, rep(seq((nrow(new_d)) / 2), each = 2)), function(i)sum(i$col_C))
数据
dput(dd)
structure(list(col_A = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), col_B = c(37L, 28L, 10L, 11L, 99L
), col_C = c(2L, 7L, 5L, 5L, 4L)), class = "data.frame", row.names = c(NA,
-5L))
【讨论】:
以上是关于计算数据框中特定列(SUM、AVG、STDEV)的所有嵌套级别聚合的主要内容,如果未能解决你的问题,请参考以下文章
hive对有null值的列进行avg,sum,count等操作时会不会过滤null值