用 R 中的多列按组计算百分比

Posted

技术标签:

【中文标题】用 R 中的多列按组计算百分比【英文标题】:Calculate Percentage by Group with multiple columns in R 【发布时间】:2022-01-21 21:40:31 【问题描述】:

我有几个包含月度数据的数据框,我想找到每种产品和每个月的百分比分布。我有几个月的多列问题。目前只能按组获取1个月的百分比。

data <- data.frame(group = rep(LETTERS[1:3], each = 4),  
                   Product = letters[1:4],
                   January = sample(1:100,12),
                   February = sample(1:100,12))

data_new1 <- transform(data,                             
                       perc = ave(January,
                                  group,
                                  FUN = prop.table))
data_new1$perc<-round(data_new1$perc, 2)

> data_new1
   group Product January February perc
1      A       a      12       16 0.05
2      A       b      73       75 0.32
3      A       c      78       11 0.34
4      A       d      65       35 0.29
5      B       a      86       63 0.36
6      B       b      33       71 0.14
7      B       c      92       49 0.38
8      B       d      30       60 0.12
9      C       a      91       59 0.37
10     C       b      31       45 0.12
11     C       c      99        7 0.40
12     C       d      28       50 0.11

【问题讨论】:

首先将您的数据从宽转换为长。 【参考方案1】:

tidyverse

library(dplyr)
data %>% 
  group_by(group) %>% 
  mutate(across(c("January", "February"), proportions, .names = ".col_perc")) %>% 
  ungroup()

# A tibble: 12 x 6
   group Product January February January_perc February_perc
   <chr> <chr>     <int>    <int>        <dbl>         <dbl>
 1 A     a            49       40      0.426          0.252 
 2 A     b             1        3      0.00870        0.0189
 3 A     c            19       50      0.165          0.314 
 4 A     d            46       66      0.4            0.415 
 5 B     a            61       82      0.218          0.285 
 6 B     b            88       51      0.314          0.177 
 7 B     c            32       75      0.114          0.260 
 8 B     d            99       80      0.354          0.278 
 9 C     a             6       31      0.0397         0.373 
10 C     b             8        5      0.0530         0.0602
11 C     c            92       20      0.609          0.241 
12 C     d            45       27      0.298          0.325 

基础

data <- data.frame(group = rep(LETTERS[1:3], each = 4),  
                   Product = letters[1:4],
                   January = sample(1:100,12),
                   February = sample(1:100,12))

tmp <- sapply(c("January", "February"), function (x) ave(data[[x]], data$group, FUN = prop.table))

colnames(tmp) <- paste0(colnames(tmp), "_perc")

res <- cbind(data, tmp)

res
#>    group Product January February January_perc February_perc
#> 1      A       a      42       73   0.18260870   0.238562092
#> 2      A       b      67       92   0.29130435   0.300653595
#> 3      A       c      58       90   0.25217391   0.294117647
#> 4      A       d      63       51   0.27391304   0.166666667
#> 5      B       a      48       15   0.21621622   0.081521739
#> 6      B       b      16       82   0.07207207   0.445652174
#> 7      B       c      80       75   0.36036036   0.407608696
#> 8      B       d      78       12   0.35135135   0.065217391
#> 9      C       a      81       16   0.32793522   0.117647059
#> 10     C       b      83       81   0.33603239   0.595588235
#> 11     C       c      11        1   0.04453441   0.007352941
#> 12     C       d      72       38   0.29149798   0.279411765

由reprex package (v2.0.1) 于 2021 年 12 月 20 日创建

data.table

library(data.table)

COLS <- c("January", "February")
COLS_RES <- paste0(COLS, "_perc")
setDT(data)[, (COLS_RES) := lapply(.SD, proportions), by = group, .SDcol = COLS][]

【讨论】:

【参考方案2】:

如果您的数据以tidy 方式构建,这些计算会更容易。在您的情况下,JanuaryFebruary 可能应该是一个名为 month 或其他名称的变量。

示例:

在下面,我使用tidyr::pivot_longer()JanuaryFebruary 合并为一列。然后我使用包 dplyr 对数据框进行分组并计算perc。我没有使用prop.table(),但我相信你只是想要观察到的比例占该组和月份的总数。


library(dplyr)
library(tidyr)

# To make the sampling underneath reproducable
set.seed(1)

data <- data.frame(
  group = rep(LETTERS[1:3], each = 4),  
  Product = letters[1:4],
  January = sample(1:100,12),
  February = sample(1:100,12)
)

data %>% 
  pivot_longer(c(January, February), names_to = "month", values_to = "x") %>% 
  group_by(group, month) %>%
  mutate(
    perc = round(x/sum(x), 2)
  )

我希望这就是你要找的。​​p>

【讨论】:

【参考方案3】:

另一个dplyr解决方案:

library(dplyr)
data %>% 
  group_by(group) %>% 
  mutate(across(c(2:5), 
                ~./sum(.)*100, .names = ".col_pct"))
# A tibble: 12 × 10
# Groups:   group [3]
   group Product   Jan   Feb   Mar   May Jan_pct Feb_pct Mar_pct May_pct
   <chr> <chr>   <int> <int> <int> <int>   <dbl>   <dbl>   <dbl>   <dbl>
 1 A     a          14    14    95    50    8      18.4    44.4    20.9 
 2 A     b         100    33    28    32   57.1    43.4    13.1    13.4 
 3 A     c          11    16    13    95    6.29   21.1     6.07   39.7 
 4 A     d          50    13    78    62   28.6    17.1    36.4    25.9 
 5 B     a          29    42    72    13   22.0    33.9    20.3     7.07
 6 B     b           3     4    88    41    2.27    3.23   24.9    22.3 
 7 B     c          30    68    94    86   22.7    54.8    26.6    46.7 
 8 B     d          70    10   100    44   53.0     8.06   28.2    23.9 
 9 C     a           4    88    45    84    3.96   43.6    24.2    30.7 
10 C     b          52    12    26    55   51.5     5.94   14.0    20.1 
11 C     c          26    20    23    57   25.7     9.90   12.4    20.8 
12 C     d          19    82    92    78   18.8    40.6    49.5    28.5

数据:

data <- data.frame(group = rep(LETTERS[1:3], each = 4),  
                   Product = letters[1:4],
                   Jan = sample(1:100,12),
                   Feb = sample(1:100,12),
                   Mar = sample(1:100, 12),
                   May = sample(1:100, 12))

【讨论】:

为什么是across(c(2:5))?` 因为第一列group是分组变量,因此不计入

以上是关于用 R 中的多列按组计算百分比的主要内容,如果未能解决你的问题,请参考以下文章

有没有办法使用聚合命令按组计算不对称平均值(例如从百分位数 0.05 到 0.5)? R-工作室

使用 mutate_ 进行标准评估以按组计算百分比

按组占总数的百分比

ggplot barplot 按组给出百分比

Groupby 对象的百分比变化(按组)

从多列值计算百分比