R中数据帧的条件求和

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了R中数据帧的条件求和相关的知识,希望对你有一定的参考价值。

我试图在R中复制SUMIFS功能。我有两个数据帧。

数据框1

allReported

ID       employeeGroup
1093     Bargaining Unit
1093     Management
1093     Non-Union
55       Bargaining Unit
55       Management
55       Non-Union

数据框2

employeeCompSummary

ID       employeeGroup      statBenefits    regularWages
1093     Management         500.00          10000.00
1093     Management         200.00          60000.00
1093     Bargaining Unit    100.00          20000.00
1093     Bargaining Unit    150.00          30000.00
1093     Non-Union          500.00          60000.00
55       Bargaining Unit    750.00          65000.00
55       Bargaining Unit    500.00          75000.00
55       Management         250.00          45000.00
55       Management         850.00          90000.00

我试图将statBenefits(以及后来的正常工资)加起来创建一个可以产生以下结果的新表:

ID       employeeGroup          statBenefits
1093     Bargaining Unit        250.00
1093     Management             700.00
1093     Non-Union              500.00
55       Bargaining Unit        1250.00
55       Management             1100.00
55       Non-Union              0.00

我尝试过以下方法:

library(data.table)
setDT(allReported)[, list(total=sum(statbenefits)), list(employeeCompSummary, employeeGroup)]

并得到以下错误:

Error in `[.data.table`(setDT(allReported), , list(total = sum(statbenefits)),  :   column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

我也尝试过:

sumTest <- aggregate(allReported, by = list(employeeCompSummary), sum)

并得到以下错误:

**Error in aggregate.data.frame(allReported, by = list(employeeCompSummary),  :   arguments must have same length**

任何人都可以提供的帮助将非常感激。我已经看过其他似乎与此有关的问题,但未能找到有效的答案。我将在多个事情上完成这项任务,所以我想知道是否有任何人都知道的简单技术。一如既往,感谢Stack Overflow上的精彩社区。

编辑两个示例表的dput():

allReported <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))

employeeCompSummary <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

 . 
答案

你可以使用dplyrmagrittr(用于%>%)包来做到这一点 -

library(dplyr)
library(magrittr)

df1 <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))

result <- left_join(df1, df2, by = c("ID", "employeeGroup")) %>%
  group_by(ID, employeeGroup) %>%
  summarize(
    statBenefits = sum(statBenefits, na.rm = T),
    regularWages = sum(regularWages, na.rm = T)
  )
result
另一答案

根据您的评论进行编辑:一种方法是以这种方式使用data.table

library(data.table)
dt1 <- data.table(structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), 
               employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), 
          row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))

dt2 <- data.table(structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), 
          row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))



dt1[dt2][, lapply(.SD, sum), .SDcols = c("statBenefits", "regularWages"), by = c("ID", "employeeGroup")]

这使

ID   employeeGroup statBenefits regularWages
1:   55 Bargaining Unit         1250       140000
2:   55      Management         1100       135000
3:   55       Non-Union           NA           NA
4: 1093 Bargaining Unit          250        50000
5: 1093      Management          700        70000
6: 1093       Non-Union          500        60000

您可以稍后将NA值替换为0

另一答案

我会做...

library(data.table)

# don't use setDT, since who knows if it works on tibbeldies
ar = data.table(allReported)
ecs = data.table(employeeCompSummary)

ecs[, total := ar[.SD, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI][, V1]]

     ID   employeeGroup total
1: 1093 Bargaining Unit   250
2: 1093      Management   700
3: 1093       Non-Union   500
4:   55 Bargaining Unit  1250
5:   55      Management  1100
6:   55       Non-Union    NA

即使OP请求了一个新表,此代码也会向ecs添加列。新表和ecs之间的行集是相同的,所以看起来浪费精神能量来携带它们。稍后删除列很简单。

如果您想知道这个“更新加入”是如何工作的,请尝试向后工作......

ar[ecs, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI]

# or

ar[ecs, on=.(ID, employeeGroup)]

注意.SD == ecs在原始代码中。见?.SD

以上是关于R中数据帧的条件求和的主要内容,如果未能解决你的问题,请参考以下文章

R中两个数据帧的条件连接

使用 R 和来自 R 数据帧的条件查询 MS SQL

如何根据多个条件对行求和 - R? [复制]

R:基于多个条件的两个数据帧的子集

在R中将具有不同长度和两个条件的不同数据帧的列相乘

根据 pandas 中的字典对数据帧的行进行分组并对相应的分子求和