使用 dplyr 按 R 中的组比较平均值（ANOVA）

Posted 2023-03-16

技术标签:

【中文标题】使用 dplyr 按 R 中的组比较平均值（ANOVA）【英文标题】：Compare Means (ANOVA) by groups in R using dplyr 【发布时间】：2020-10-10 22:52:09 【问题描述】：

我有针对不同子组（例如按课程、年龄组、性别）的调查问题的汇总汇总结果（N、平均值、标准差）。我想确定那些存在统计显着条目的子组，以便能够进一步调查结果。理想情况下，这一切都应该在使用 tidyverse / dplyr 为 R Markdown 中的报告准备数据的过程中工作。

我的数据如下所示：

> head(demo, 11)
# A tibble: 11 x 7
# Groups:   qid, subgroup [3]
     qid question subgroup name       N  mean    sd
   <int> <chr>    <chr>    <chr>  <dbl> <dbl> <dbl>
 1     1 noise     NA       total   214  3.65 1.03
 2     1 noise     course   A       11  4     0.77
 3     1 noise     course   B       47  3.55  1.16
 4     1 noise     course   C       31  3.29  1.24
 5     1 noise     course   D       40  3.8   0.85
 6     1 noise     course   E       16  3.38  1.09
 7     1 noise     course   F       11  3.55  1.13
 8     1 noise     course   G       25  4.12  0.73
 9     1 noise     course   H       25  3.68  0.85
10     1 noise     gender   f       120 3.65  1.07
11     1 noise     gender   m       93  3.67  0.98

我想要的是一个新列，如果给定问题的子组内存在统计显着差异，则表示 TRUE，否则表示 FALSE。就像下面的 sigdiff：

     qid question subgroup name       N  mean    sd     sigdiff     
   <int> <chr>    <chr>    <chr>  <dbl> <dbl> <dbl>       <lgl>
 2     1 noise     course   A       11  4     0.77        FALSE
 3     1 noise     course   B       47  3.55  1.16        FALSE 
 4     1 noise     course   C       31  3.29  1.24        FALSE 
 5     1 noise     course   D       40  3.8   0.85        FALSE 
 6     1 noise     course   E       16  3.38  1.09        FALSE 
 7     1 noise     course   F       11  3.55  1.13        FALSE 
 8     1 noise     course   G       25  4.12  0.73        FALSE 
 9     1 noise     course   H       25  3.68  0.85        FALSE

现在，解决此问题的一种非常巧妙的方法似乎是通过基于 rpsychi 包调整 this approach 来确定任何组之间是否存在显着差异。

我失败了，但是将其调整为适用于我的分组 tibble。我的（失败）方法是尝试通过 dplyr 的新 group_map 简单地调用一个执行 ANOVA 的函数：

if(!require(rpsychi))install.packages("rpsychi")
library(rpsychi)
if(!require(tidyverse))install.packages("tidyverse")
library(tidyverse)

#' function establishing significant difference
#' between survey answers within subgroups

anovagrptest <- function(grpsum)
  
      anovaresult <- ind.oneway.second(grpsum$mean, grpsum$sd, grpsum$N, sig.level = 0.05)
      
      # compare critical F Value
      fcrit <- qf(.95, anovaresult$anova.table$df[1], anovaresult$anova.table$df[2])
      if(anovaresult$anova.table$F[1] > fcrit)return(TRUE)
      elsereturn(FALSE)
    

#' pass the subset of the data for the group to the function which 
#' "returns a list of results from calling .f on each group"

relquestions <- demo %>% 
  group_by(qid, subgroup) %>% 
  group_map(~ anovagrptest(.x))

由于“delta.upper + dfb 中的错误：二元运算符的非数字参数”，代码中止。非常感谢您的想法。

【问题讨论】：

【参考方案1】：

我认为您与NA 的行会导致您的问题。首先：我认为您不需要该映射功能（但老实说，我不是 100% 确定）。

demo %>% 
  select(-id) %>%
  group_by(qid, subgroup) %>%
  mutate(new_column = ind.oneway.second(mean, sd, N, sig.level = 0.05) %>%
           qf(.95, .[["anova.table"]][["df"]][1], .[["anova.table"]][["df"]][2]) < .[["anova.table"]][["F"]][1])

原因

Error: Problem with `mutate()` input `new_column`.
x non-numeric argument for binary operator
i Input `new_column` is ``%>%`(...)`.
i The error occured in group 3: qid = 1, subgroup = NA.
Run `rlang::last_error()` to see where the error occurred.

当我删除包含NA的行时

demo %>% 
  select(-id) %>%
  group_by(qid, subgroup) %>%
  drop_na() %>%
  mutate(new_column = ind.oneway.second(mean, sd, N, sig.level = 0.05) %>%
           qf(.95, .[["anova.table"]][["df"]][1], .[["anova.table"]][["df"]][2]) < .[["anova.table"]][["F"]][1])

我明白了

# A tibble: 10 x 8
# Groups:   qid, subgroup [2]
     qid question subgroup name      N  mean    sd new_column
   <dbl> <chr>    <chr>    <chr> <dbl> <dbl> <dbl> <lgl>  
 1     1 noise    course   A        11  4     0.77 FALSE  
 2     1 noise    course   B        47  3.55  1.16 FALSE  
 3     1 noise    course   C        31  3.29  1.24 FALSE  
 4     1 noise    course   D        40  3.8   0.85 FALSE  
 5     1 noise    course   E        16  3.38  1.09 FALSE  
 6     1 noise    course   F        11  3.55  1.13 FALSE  
 7     1 noise    course   G        25  4.12  0.73 FALSE  
 8     1 noise    course   H        25  3.68  0.85 FALSE  
 9     1 noise    gender   f       120  3.65  1.07 FALSE  
10     1 noise    gender   m        93  3.67  0.98 FALSE

【讨论】：

以上是关于使用 dplyr 按 R 中的组比较平均值（ANOVA）的主要内容，如果未能解决你的问题，请参考以下文章