使用dplyr汇总并保持相同的变量名称

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用dplyr汇总并保持相同的变量名称相关的知识,希望对你有一定的参考价值。

我发现data.table和dplyr在尝试做同样的事情时会有不同的结果。我想使用dplyr语法,但让它以data.table的方式进行计算。用例是我想在表格中添加小计。为此,我需要对每个变量进行一些聚合,但是保留相同的变量名称(在转换后的版本中)。 Data.table允许我对变量执行一些聚合并保持相同的名称。然后用同一个变量做另一个聚合。它将继续使用未转换的版本。但是,Dplyr将使用转换后的版本。

在摘要文档中,它说:

# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
  group_by(cyl) %>%
  summarise(disp = mean(disp), sd = sd(disp))

这基本上是我遇到的问题,但我想知道是否有一个很好的解决方法。我发现的一件事就是将变换后的变量命名为其他东西然后在最后重命名它,但这对我来说并不是很好。如果有一个很好的方法来做小计,那也很好。我环顾了这个网站,没有看到这个确切的情况。任何帮助将不胜感激!

这里我做了一个简单的例子,一次是data.table的结果,一次是dplyr的。我想采用这个简单的表并附加一个小计行,它是感兴趣的列的加权平均值(总计)。

library(data.table)
library(dplyr)

dt <- data.table(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]

dt <- rbind(dt[, .(Group, Count_Dist, Total)],
      dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")

dt

df <- data.frame(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))

df %>%
  mutate(Count_Dist = Count/sum(Count)) %>%
  summarize(Count_Dist = sum(Count_Dist),
            Weighted_Total = sum(Count_Dist*Total))

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Total) %>% 
  rbind(df %>%
          mutate(Count_Dist = Count/sum(Count)) %>%
          summarize(Group = "All",
                    Count_Dist = sum(Count_Dist),
                    Total = sum(Count_Dist*Total))) %>% 
  rename(Weighted_Avg_Total = Total)

再次感谢任何帮助!

答案

一个可能的解决方案是跳过mutate步骤并使用transmute作为第一个mutate / select步骤并直接从原始变量计算所需变量而不为第二个mutate步骤创建中间变量:

df %>% 
  transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total)))

这使:

  Group Count_Dist Weighted_Avg_Total
1     A 0.09345794            50.0000
2     B 0.14018692           300.0000
3     C 0.11214953           600.0000
4     D 0.18691589           400.0000
5     E 0.46728972          1000.0000
6   All 1.00000000           656.0748

另一种可能的解决方案是改变在dplyr中计算新变量的顺序,然后使用select将列顺序恢复到您最初想要的位置:

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Weighted_Avg_Total = Total) %>% 
  bind_rows(df %>%
              mutate(Count_Dist = Count/sum(Count)) %>%
              summarize(Group = "All",
                        Weighted_Avg_Total = sum(Count_Dist*Total),
                        Count_Dist = sum(Count_Dist)) %>% 
              select(Group, Count_Dist, Weighted_Avg_Total))

如果你想包括Count-column,你也可以(根据我在下面的评论):

df %>% 
  transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>% 
  bind_rows(df %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count/sum(Count)),
                        Weighted_Avg_Total = sum((Count/sum(Count))*Total),
                        Count = sum(Count)))
另一答案

一种替代方案可能是使用mutate两次计算甚至Weighted_Total并使用sum中该列的summarize

df %>%
  mutate(Count_Dist = Count/sum(Count)) %>%
  mutate(Weighted_Total = Count_Dist*Total) %>%
  summarize(Count_Dist = sum(Count_Dist),
            Weighted_Total = sum(Weighted_Total))
Result:
  Count_Dist Weighted_Total
1          1     656.074766

和:

    df %>% 
      mutate(Count_Dist = Count/sum(Count)) %>%
      select(Group, Count_Dist, Total) %>% 
      rbind(df %>%
              mutate(Count_Dist = Count/sum(Count)) %>%
              mutate(Weighted_Total = Count_Dist*Total) %>%
              summarize(Group = "All",
                        Count_Dist = sum(Count_Dist),
                        Total = sum(Weighted_Total))) %>% 
      rename(Weighted_Avg_Total = Total)

Result:

      Group   Count_Dist Weighted_Avg_Total
    1     A 0.0934579439          50.000000
    2     B 0.1401869159         300.000000
    3     C 0.1121495327         600.000000
    4     D 0.1869158879         400.000000
    5     E 0.4672897196        1000.000000
    6   All 1.0000000000         656.074766

以上是关于使用dplyr汇总并保持相同的变量名称的主要内容,如果未能解决你的问题,请参考以下文章

使用 dplyr、group_by 和折叠或汇总连接字符串/行,但保持 NA 值 [重复]

R:dplyr 有条件地汇总并重新编码列中的值

R:dplyr条件汇总并按列重新编码值

R dplyr如何通过列号而不是通过汇总的列名选择变量

在dplyr中,如何删除和重命名不存在的列,操作所有名称,并使用字符串命名新变量?

合并相同名称并获取支持数据的总和 - Reprex