通过分组应用在数据帧列表上
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了通过分组应用在数据帧列表上相关的知识,希望对你有一定的参考价值。
我有一个包含两个数据帧的列表。
library(tidyverse)
dat <- list("seniors" = data.frame(NAME = c("Cletus", "Agnes", "Hank", "Sue", "Maude"),
COOL = c(0, 1, 1, 0, 1),
GENDER = c("Male", "Female", "Male", "Female", "Female"),
RACE = c("B", "B", "W", "W", "B")),
"juniors" = data.frame(NAME = c("Chester", "Chuck", "Bruce", "Carmen", "Cleo"),
COOL = c(1, 1, 1, 0, 1),
GENDER = c("Male", "Male", "Male", "Female", "Female"),
RACE = c("W", "W", "B", "W", "W")))
如果我想在两个数据帧之间获取特定分组变量的计数,例如gender
,并根据个人是否为cool
进行分组,则可以使用以下代码:
results <- lapply(names(dat), function(x)
dat[[x]] %>%
group_by(COOL, GENDER) %>%
summarise(TOTAL = n()) %>%
mutate(COHORT = x) %>%
select(COHORT, everything())
)
do.call(rbind, results)
但是,我希望能够获得超过n
个分组变量的计数,而无需重复执行代码n
次,并将所有结果都放在一个表中。并不是说我always要按COOL
分组时,第二个分组变量就是要更改的变量。
[我期望的输出如下(请注意,TOTAL
图未反映示例数据,我主要只是试图显示期望的表结构)。此外,我认识到此表结构不符合整洁的原则,只是需要这种方式才能最终在Excel中使用vlookup。
COHORT COOL GROUP_VAR GROUP_VAL TOTAL
SENIORS 0 GENDER MALE 3
SENIORS 1 GENDER MALE 5
SENIORS 0 GENDER FEMALE 7
SENIORS 1 GENDER FEMALE 2
SENIORS 0 RACE B 2
SENIORS 1 RACE B 3
SENIORS 0 RACE W 7
SENIORS 1 RACE W 9
JUNIORS 0 GENDER MALE 3
JUNIORS 1 GENDER MALE 5
JUNIORS 0 GENDER FEMALE 3
JUNIORS 1 GENDER FEMALE 1
JUNIORS 0 RACE B 2
JUNIORS 1 RACE B 7
JUNIORS 0 RACE W 3
JUNIORS 1 RACE W 2
我尝试将结果列表包装在另一个带有列名列表的lapply包装器中(请参见下文),但这不起作用:
group_names <- list("GENDER", "RACE")
lapply(names(dat), function(x)
lapply(names(group_names), function (y)
dat[[x]] %>%
group_by(COOL, y) %>%
summarise(TOTAL = n()) %>%
mutate(COHORT = x,
GROUP = y) %>%
select(COHORT, everything())
)
)
任何人都知道我如何以一种优雅而有效的方式做到这一点?
谢谢!
答案
您可以使用函数tibble::enframe()
将数据框列表转换为单个数据框,您可以在其中应用分组过程。根据dplyr::count()
中的变量名称,指定分组变量:
library(dplyr)
library(tidyr)
library(tibble)
dat %>%
enframe("COHORT", "data") %>%
unnest(data) %>%
count(COHORT, COOL, GENDER, name="TOTAL")
# A tibble: 7 x 4
COHORT COOL GENDER TOTAL
<chr> <dbl> <fct> <int>
1 juniors 0 Female 1
2 juniors 1 Female 1
3 juniors 1 Male 3
4 seniors 0 Female 1
5 seniors 0 Male 1
6 seniors 1 Female 2
7 seniors 1 Male 1
这是否回答了您的问题?
=========================================>
基于@DJC的评论,我在这里提出一个更合适的解决方案:
dat %>%
enframe("COHORT", "data") %>%
unnest(data) %>%
gather(GROUP_VAR, GROUP_VAL, GENDER, RACE) %>%
count(COHORT, COOL, GROUP_VAR, GROUP_VAL, name="TOTAL")
# A tibble: 14 x 5
COHORT COOL GROUP_VAR GROUP_VAL TOTAL
<chr> <dbl> <chr> <chr> <int>
1 juniors 0 GENDER Female 1
2 juniors 0 RACE W 1
3 juniors 1 GENDER Female 1
4 juniors 1 GENDER Male 3
5 juniors 1 RACE B 1
6 juniors 1 RACE W 3
7 seniors 0 GENDER Female 1
8 seniors 0 GENDER Male 1
9 seniors 0 RACE B 1
10 seniors 0 RACE W 1
11 seniors 1 GENDER Female 2
12 seniors 1 GENDER Male 1
13 seniors 1 RACE B 2
14 seniors 1 RACE W 1
另一答案
以下是解决问题的两种方法:
以上是关于通过分组应用在数据帧列表上的主要内容,如果未能解决你的问题,请参考以下文章
列表上的 Spark 数据帧操作返回 [Ljava.lang.Object;@]