当存在缺失值时，将汇总与多个函数一起使用

Posted 2023-03-24

技术标签:

【中文标题】当存在缺失值时，将汇总与多个函数一起使用【英文标题】：Using summarize across with multiple functions when there are missing values 【发布时间】：2021-08-12 09:43:21 【问题描述】：

如果我想使用mtcars 数据集获取所有数字列的平均值和总和，我会使用以下代码：

  group_by(gear) %>% 
  summarise(across(where(is.numeric), list(mean = mean, sum = sum)))

但是，如果我在某些列中缺少值，我该如何考虑呢？这是一个可重现的示例：

test.df1 <- data.frame("Year" = sample(2018:2020, 20, replace = TRUE), 
                       "Firm" = head(LETTERS, 5), 
                       "Exporter"= sample(c("Yes", "No"), 20, replace = TRUE), 
                       "Revenue" = sample(100:200, 20, replace = TRUE),
                         stringsAsFactors =  FALSE)

test.df1 <- rbind(test.df1, 
                    data.frame("Year" = c(2018, 2018),
                               "Firm" = c("Y", "Z"),
                               "Exporter" = c("Yes", "No"),
                               "Revenue" = c(NA, NA)))

test.df1 <- test.df1 %>% mutate(Profit = Revenue - sample(20:30, 22, replace = TRUE ))

test.df_summarized <- test.df1 %>% group_by(Firm) %>% summarize(across(where(is.numeric)), list(mean = mean, sum = sum)))

如果我只是 summarize 每个变量单独，我可以使用以下内容：

test.df1 %>% group_by(Firm) %>% summarize(Revenue_mean = mean(Revenue, na.rm = TRUE,
Profit_mean = mean(Profit, na.rm = TRUE)

但我想弄清楚如何将上面为mtcars 编写的代码调整为我在此处提供的示例数据集。

【问题讨论】：

【参考方案1】：

因为您的函数都有一个 na.rm 参数，您可以将它与 ... 一起传递

test.df1 %>% summarize(across(where(is.numeric), list(mean = mean, sum = sum), na.rm = TRUE))
#   Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
# 1  2019.045    44419       162.35        3247      138.25       2765

（我省略了group_by，因为它没有在您的代码中正确指定，并且没有它的示例仍然可以很好地说明。还要确保您的函数在inside across()。）

【讨论】：

啊！这就是我所缺少的。伟大的。非常感谢！【参考方案2】：

为了记录，你也可以这样做（当不同的函数有不同的参数时，这有效）

test.df1 %>% 
summarise(across(where(is.numeric), 
          list(
             mean = ~ mean(.x, na.rm = T), 
             sum = ~ sum(.x, na.rm = T))
            )
)
#    Year_mean Year_sum Revenue_mean Revenue_sum Profit_mean Profit_sum
#  1  2019.045    44419       144.05        2881       119.3       2386

【讨论】：

以上是关于当存在缺失值时，将汇总与多个函数一起使用的主要内容，如果未能解决你的问题，请参考以下文章

pandas将多个Series对象合并起来形成dataframe当索引不一致时会产生缺失值NaN

R进阶：缺失值的处理、拟合关系

多重插补为啥要汇总分析

pandas 比较两个不同大小的数据帧映射值，并在缺失值时添加任意值

03_特征清洗

如何区分空值字段与杰克逊库中的缺失字段