根据给定的因子在 R 中按函数分组获取多列的百分比值

Posted

技术标签:

【中文标题】根据给定的因子在 R 中按函数分组获取多列的百分比值【英文标题】:Get percentage values across multiple columns based on factors given a group by function in R 【发布时间】:2021-12-03 06:40:00 【问题描述】:

我有一个只查看 1 个 ID 及其各自资产的 df:

ID  | Asset | CONF_1 |       CONF_2 |    CONF_3 | 
1       A     PERFECT        HIGH        LOW    
1       B     PERFECT        LOW         LOW
1       C     LOW            HIGH        VERY LOW
1       D     NA             MEDIUM      MEDIUM
1       E     MEDIUM         MEDIUM      PERFECT
1       F     MEDIUM         VERY LOW    NA
1       G     VERY LOW       VERY LOW    VERY LOW
1       H     NA             PERFECT     HIGH

目标是重新组织 df,这样我就可以在给定每个 ID 和 3 个 Conf 字段的情况下分解每个置信水平(PERFECT、HIGH、MEDIUM、ETC)的百分比。

期望的输出

ID | CONFIDENCE | CONF_1 % | CONF_2 % | CONF_3 %
1     PERFECT      25 %      12.5 %     12.5 %
1     HIGH         0         25 %       12.5 %
1     MEDIUM       25 %      25 %       12.5 %
1     LOW          12.5 %    12.5 %     25 %
1     VERY LOW     12.5 %    25 %       25 %
1     NA           25 %      0          12.5 %

【问题讨论】:

我相信@akrun - 分母是 8,我只是计算了每个 conf 级别出现的次数,共 8 次 请查看我发布的解决方案 【参考方案1】:

按“ID”分组,summariseacross“CONF”列,在factor 列上使用table 获取频率计数,并在顺序中指定levels,并找到proportions

library(dplyr)
df1 %>% 
   group_by(ID) %>% 
   summarise(lvls = c("PERFECT", "HIGH", "MEDIUM", "LOW", "VERY LOW", NA), 
    across(starts_with("CONF"), 
    ~ 100 * proportions(table(factor(., levels = na.omit(lvls)), 
        useNA = "always"))), .groups = 'drop') %>%
   rename(CONFIDENCE = lvls)

-输出

# A tibble: 6 × 5
     ID CONFIDENCE CONF_1  CONF_2  CONF_3 
  <int> <chr>      <table> <table> <table>
1     1 PERFECT    25.0    12.5    12.5   
2     1 HIGH        0.0    25.0    12.5   
3     1 MEDIUM     25.0    25.0    12.5   
4     1 LOW        12.5    12.5    25.0   
5     1 VERY LOW   12.5    25.0    25.0   
6     1 <NA>       25.0     0.0    12.5   

--

或者另一种选择是使用pivot_longer 重塑为“长”格式,执行count 并使用pivot_wider 重塑为“宽”格式

library(tidyr)
df1 %>% 
   select(-Asset) %>% 
   pivot_longer(cols = starts_with("CONF"), values_to = 'CONFIDENCE') %>% 
   count(ID, name, CONFIDENCE) %>%
   group_by(ID, name) %>%
   mutate(n = 100 *n/sum(n) ) %>%
   ungroup %>%
   pivot_wider(names_from = name, values_from = n, values_fill = 0)

-输出

# A tibble: 6 × 5
     ID CONFIDENCE CONF_1 CONF_2 CONF_3
  <int> <chr>       <dbl>  <dbl>  <dbl>
1     1 LOW          12.5   12.5   25  
2     1 MEDIUM       25     25     12.5
3     1 PERFECT      25     12.5   12.5
4     1 VERY LOW     12.5   25     25  
5     1 <NA>         25      0     12.5
6     1 HIGH          0     25     12.5

数据

df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Asset = c("A", 
"B", "C", "D", "E", "F", "G", "H"), CONF_1 = c("PERFECT", "PERFECT", 
"LOW", NA, "MEDIUM", "MEDIUM", "VERY LOW", NA), CONF_2 = c("HIGH", 
"LOW", "HIGH", "MEDIUM", "MEDIUM", "VERY LOW", "VERY LOW", "PERFECT"
), CONF_3 = c("LOW", "LOW", "VERY LOW", "MEDIUM", "PERFECT", 
NA, "VERY LOW", "HIGH")), class = "data.frame", row.names = c(NA, 
-8L))

【讨论】:

很棒 - 我开始使用 pivot_wider 但这太棒了。谢谢。 有没有办法按 ID 排序,然后是 CONFIDENCE 级别?因此,每个 ID 具有相同的 CONFIDENCE 级别顺序,并且每个 ID 都会重置。 @Dinho 你可以指定arrange(ID, factor(CONFIDENCE, levels = c("PERFECT", "HIGH", "MEDIUM", "LOW", "VERY LOW"))) in betwen【参考方案2】:

基于reshape2的解决方案:

library(dplyr)
library(reshape2)

df %>% 
  melt(id.vars="ID", measure.vars=paste0("CONF_",1:3), variable.name="X") %>% 
  dcast(ID + X ~ value, fun.aggregate = length) %>% 
  melt(id.vars=c("ID","X"), measure.vars=3:ncol(.)  ) %>% 
  dcast(ID+variable ~ X) %>% 
  group_by(ID) %>% 
  mutate(across(starts_with("CONF_"), ~ .x*100 /sum(.x))) %>% 
  rename(CONFIDENCE=variable) %>% 
  arrange(ID,CONFIDENCE)

【讨论】:

以上是关于根据给定的因子在 R 中按函数分组获取多列的百分比值的主要内容,如果未能解决你的问题,请参考以下文章

R语言ggplot2可视化百分比显示实战:纵轴显示为百分比在柱状图上显示百分比按照因子变量绘制分组子图(纵轴显示为百分比)可视化图中显示数据百分比

在 Laravel 中按多列分组

如何在 MySQL Select 语句中按多列分组

如何在 SQL 中按多列分组并按日期排序?

如何在R中按两列分组

用 R 中的多列按组计算百分比