在 R 中使用 aggregate/group_by 对数据进行分组并对每个因子变量进行计数?

Posted

技术标签:

【中文标题】在 R 中使用 aggregate/group_by 对数据进行分组并对每个因子变量进行计数?【英文标题】:Using aggregate/group_by in R to group data and give a count for each factor variable? 【发布时间】:2022-01-12 23:56:04 【问题描述】:

我有一个看起来像这样的数据框。为了简单起见,我展示了前 6 行,但总行数为 8236。等级范围为 0-2。我刚刚在下面的示例中显示了 0 级和 1 级:

 Telangiectasia_time      grade
  <chr>                    <int>
1 telangiectasia_tumour_0      0
2 telangiectasia_tumour_1      0
3 telangiectasia_tumour_12     0
4 telangiectasia_tumour_24     0
5 telangiectasia_tumour_0      1
6 telangiectasia_tumour_1      1

我想按 Telangiectasia_Time(第一列)分组,然后计算每组的成绩数。因此,以前 6 行为例,它应该如下所示:

       Telangiectasia_time grade0    grade1    grade2 
1  telangiectasia_tumour_0    1      1          0
2  telangiectasia_tumour_1    1      1          0
3 telangiectasia_tumour_12    1      0          0
4 telangiectasia_tumour_24    1      0          0  

最后有三列分别代表各个等级,每个变量的每个等级都有一个计数。我尝试使用聚合函数:

**aggregate(grade ~ Telangiectasia_time, telangiectasia_tumour_data, *sum*)** 

但我不确定在括号的最后一位中放什么,以便返回每个等级的总和。当我输入总和时,它只是将数字相加,而不是将变量视为单独的(0,1 和 2)。使用我的完整数据集,我得到了错误的输出:

      Telangiectasia_time grade
1  telangiectasia_tumour_0    18
2  telangiectasia_tumour_1    11
3 telangiectasia_tumour_12    38
4 telangiectasia_tumour_24    87

我也尝试过 group_by() 但这只是给了我一个总数

telangiectasia_tumour_data %>% group_by(Telangiectasia_time) %>% summarize(count =n())
  Telangiectasia_time      count
* <chr>                    <int>
1 telangiectasia_tumour_0   2059
2 telangiectasia_tumour_1   2059
3 telangiectasia_tumour_12  2059
4 telangiectasia_tumour_24  2059

【问题讨论】:

【参考方案1】:

使用dpylr::counttidyr::pivot_wider 你可以这样做:

library(dplyr)
library(tidyr)

telangiectasia_tumour_data %>% 
  count(Telangiectasia_time, grade) %>% 
  pivot_wider(names_from = grade, values_from = n, names_prefix = "grade", values_fill = 0)
#> # A tibble: 4 × 3
#>   Telangiectasia_time      grade0 grade1
#>   <chr>                     <int>  <int>
#> 1 telangiectasia_tumour_0       1      1
#> 2 telangiectasia_tumour_1       1      1
#> 3 telangiectasia_tumour_12      1      0
#> 4 telangiectasia_tumour_24      1      0

数据

telangiectasia_tumour_data <- structure(list(Telangiectasia_time = c(
  "telangiectasia_tumour_0",
  "telangiectasia_tumour_1", "telangiectasia_tumour_12", "telangiectasia_tumour_24",
  "telangiectasia_tumour_0", "telangiectasia_tumour_1"
), grade = c(
  0L,
  0L, 0L, 0L, 1L, 1L
)), class = "data.frame", row.names = c(
  "1",
  "2", "3", "4", "5", "6"
))

【讨论】:

以上是关于在 R 中使用 aggregate/group_by 对数据进行分组并对每个因子变量进行计数?的主要内容,如果未能解决你的问题,请参考以下文章

r 在R中使用sqlite表

R 'mvpart' 包 - 在 R 3.1.x 中使用的任何选项?

在 R 中使用 mapreduce

译文怎样在R语言中使用SQL命令

r 在R中使用morph.io API

在 R 脚本中为 R 变量使用 where 子句以在 SQL 语句中使用它