根据给定的因子在 R 中按函数分组获取多列的百分比值
Posted
技术标签:
【中文标题】根据给定的因子在 R 中按函数分组获取多列的百分比值【英文标题】:Get percentage values across multiple columns based on factors given a group by function in R 【发布时间】:2021-12-03 06:40:00 【问题描述】:我有一个只查看 1 个 ID 及其各自资产的 df:
ID | Asset | CONF_1 | CONF_2 | CONF_3 |
1 A PERFECT HIGH LOW
1 B PERFECT LOW LOW
1 C LOW HIGH VERY LOW
1 D NA MEDIUM MEDIUM
1 E MEDIUM MEDIUM PERFECT
1 F MEDIUM VERY LOW NA
1 G VERY LOW VERY LOW VERY LOW
1 H NA PERFECT HIGH
目标是重新组织 df,这样我就可以在给定每个 ID 和 3 个 Conf 字段的情况下分解每个置信水平(PERFECT、HIGH、MEDIUM、ETC)的百分比。
期望的输出
ID | CONFIDENCE | CONF_1 % | CONF_2 % | CONF_3 %
1 PERFECT 25 % 12.5 % 12.5 %
1 HIGH 0 25 % 12.5 %
1 MEDIUM 25 % 25 % 12.5 %
1 LOW 12.5 % 12.5 % 25 %
1 VERY LOW 12.5 % 25 % 25 %
1 NA 25 % 0 12.5 %
【问题讨论】:
我相信@akrun - 分母是 8,我只是计算了每个 conf 级别出现的次数,共 8 次 请查看我发布的解决方案 【参考方案1】:按“ID”分组,summarise
across
“CONF”列,在factor
列上使用table
获取频率计数,并在顺序中指定levels
,并找到proportions
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(lvls = c("PERFECT", "HIGH", "MEDIUM", "LOW", "VERY LOW", NA),
across(starts_with("CONF"),
~ 100 * proportions(table(factor(., levels = na.omit(lvls)),
useNA = "always"))), .groups = 'drop') %>%
rename(CONFIDENCE = lvls)
-输出
# A tibble: 6 × 5
ID CONFIDENCE CONF_1 CONF_2 CONF_3
<int> <chr> <table> <table> <table>
1 1 PERFECT 25.0 12.5 12.5
2 1 HIGH 0.0 25.0 12.5
3 1 MEDIUM 25.0 25.0 12.5
4 1 LOW 12.5 12.5 25.0
5 1 VERY LOW 12.5 25.0 25.0
6 1 <NA> 25.0 0.0 12.5
--
或者另一种选择是使用pivot_longer
重塑为“长”格式,执行count
并使用pivot_wider
重塑为“宽”格式
library(tidyr)
df1 %>%
select(-Asset) %>%
pivot_longer(cols = starts_with("CONF"), values_to = 'CONFIDENCE') %>%
count(ID, name, CONFIDENCE) %>%
group_by(ID, name) %>%
mutate(n = 100 *n/sum(n) ) %>%
ungroup %>%
pivot_wider(names_from = name, values_from = n, values_fill = 0)
-输出
# A tibble: 6 × 5
ID CONFIDENCE CONF_1 CONF_2 CONF_3
<int> <chr> <dbl> <dbl> <dbl>
1 1 LOW 12.5 12.5 25
2 1 MEDIUM 25 25 12.5
3 1 PERFECT 25 12.5 12.5
4 1 VERY LOW 12.5 25 25
5 1 <NA> 25 0 12.5
6 1 HIGH 0 25 12.5
数据
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Asset = c("A",
"B", "C", "D", "E", "F", "G", "H"), CONF_1 = c("PERFECT", "PERFECT",
"LOW", NA, "MEDIUM", "MEDIUM", "VERY LOW", NA), CONF_2 = c("HIGH",
"LOW", "HIGH", "MEDIUM", "MEDIUM", "VERY LOW", "VERY LOW", "PERFECT"
), CONF_3 = c("LOW", "LOW", "VERY LOW", "MEDIUM", "PERFECT",
NA, "VERY LOW", "HIGH")), class = "data.frame", row.names = c(NA,
-8L))
【讨论】:
很棒 - 我开始使用 pivot_wider 但这太棒了。谢谢。 有没有办法按 ID 排序,然后是 CONFIDENCE 级别?因此,每个 ID 具有相同的 CONFIDENCE 级别顺序,并且每个 ID 都会重置。 @Dinho 你可以指定arrange(ID, factor(CONFIDENCE, levels = c("PERFECT", "HIGH", "MEDIUM", "LOW", "VERY LOW")))
in betwen【参考方案2】:
基于reshape2
的解决方案:
library(dplyr)
library(reshape2)
df %>%
melt(id.vars="ID", measure.vars=paste0("CONF_",1:3), variable.name="X") %>%
dcast(ID + X ~ value, fun.aggregate = length) %>%
melt(id.vars=c("ID","X"), measure.vars=3:ncol(.) ) %>%
dcast(ID+variable ~ X) %>%
group_by(ID) %>%
mutate(across(starts_with("CONF_"), ~ .x*100 /sum(.x))) %>%
rename(CONFIDENCE=variable) %>%
arrange(ID,CONFIDENCE)
【讨论】:
以上是关于根据给定的因子在 R 中按函数分组获取多列的百分比值的主要内容,如果未能解决你的问题,请参考以下文章
R语言ggplot2可视化百分比显示实战:纵轴显示为百分比在柱状图上显示百分比按照因子变量绘制分组子图(纵轴显示为百分比)可视化图中显示数据百分比