按月计算的变量频率

Posted 2023-02-14

技术标签:

【中文标题】按月计算的变量频率【英文标题】：Frequency of a variable by month 【发布时间】：2022-01-14 11:05:25 【问题描述】：

我正在按月查找颜色频率。我想用每个月的每种颜色的百分比制作一个折线图。这是我的数据：


ID    color1   color2  color3  date
55    red     blue     NA     2020-03-15
67    yellow  NA       NA     2020-05-02
83    blue    yellow   NA     2020-05-17
78    red     yellow   blue   2020-05-15  
43    green   NA       NA     2021-01-27
29    yellow  green    NA     2021-01-03

我需要这样的东西来绘制图表。我需要当月的文章数作为分母。所以如果ID有多种颜色（比如03/2020中的IDs都是蓝色和红色），总的百分比可以大于100。


Month     n  freq_blue freq_red  freq_yellow  freq_green %_blue %_red   _yellow %_green
03-2020   1    1        1          0           0            100  100     0       0
04-2020   0    0        0          0           0            0     0      0       0
05-2020   3    2        1          3           0            66.7  33.3   100     0
06-2020   0    0        0          0           0            0     0      0       0
07-2020   0    0        0          0           0            0     0      0       0
08-2020   0    0        0          0           0            0     0      0       0
09-2020   0    0        0          0           0            0     0      0       0
10-2020   0    0        0          0           0            0     0      0       0
11-2020   0    0        0          0           0            0     0      0       0
12-2020   0    0        0          0           0            0     0      0       0
01-2021   2    0        0          1           2            0     0      50     100

【问题讨论】：

到目前为止您尝试过什么？一些代码可以帮助你更清楚你到底想要做什么 df$date % mutate(month = month(date), year = year(date)) df2 % group_by(month,year) %>% mutate(count=length(unique(PMID))) df2% pivot_longer(cols = starts_with("color")) %>% filter(!is.na( value)) %>% group_by(month, year, value) %>% count() %>% group_by(month, year) %>% mutate(percent = n/count) %>% ungroup() %>% complete （年，月 = 1:12，值 = c（“蓝色”，“红色”，“黄色”，“绿色”），填充 = 列表（n = 0，百分比 = 0））%>% pivot_wider(id_cols = c(month, year), names_from = value, values_from = c(n, percent)) 这就是我目前所尝试的，一行中的字符太多，抱歉你可以edit这个问题来包含你的代码，这样会更容易理解 【参考方案1】：

按照建议，如果您可以使用您的代码而不是在下面的 cmets 中编辑您的原始帖子/问题，这将很有帮助。

根据您所拥有的（以及您之前的问题），这可能会有所帮助。

考虑创建一个新列month_total，您可以将其用于百分比计算 - 您似乎想知道给定月份的 ID 数量（不清楚一种颜色是否可以连续出现多次）。

确定频率和百分比后，使用complete 填写缺失的月份和颜色，您还可以使用fill 包括每月总计。

library(tidyverse)
library(lubridate)

df$date <- as.Date(df$date)

df %>%
  mutate(month = month(date), year = year(date)) %>%
  pivot_longer(cols = starts_with("color")) %>%
  filter(!is.na(value)) %>%
  group_by(month, year) %>%
  mutate(month_total = n_distinct(ID)) %>%
  group_by(value, month_total, .add = T) %>%
  summarise(freq = n(), percent = freq/month_total[1] * 100) %>%
  ungroup() %>%
  complete(year, month = 1:12, value = c("blue", "red", "yellow", "green"), fill = list(freq = 0, percent = 0)) %>%
  group_by(year, month) %>%
  fill(month_total, .direction = "updown") %>%
  pivot_wider(id_cols = c(month, year, month_total), names_from = value, values_from = c(freq, percent)) %>%
  replace_na(list(month_total = 0))

输出

   month  year month_total freq_blue freq_green freq_red freq_yellow percent_blue percent_green percent_red percent_yellow
   <dbl> <dbl>       <dbl>     <dbl>      <dbl>    <dbl>       <dbl>        <dbl>         <dbl>       <dbl>          <dbl>
 1     1  2020           0         0          0        0           0          0               0         0                0
 2     2  2020           0         0          0        0           0          0               0         0                0
 3     3  2020           1         1          0        1           0        100               0       100                0
 4     4  2020           0         0          0        0           0          0               0         0                0
 5     5  2020           3         2          0        1           3         66.7             0        33.3            100
 6     6  2020           0         0          0        0           0          0               0         0                0
 7     7  2020           0         0          0        0           0          0               0         0                0
 8     8  2020           0         0          0        0           0          0               0         0                0
 9     9  2020           0         0          0        0           0          0               0         0                0
10    10  2020           0         0          0        0           0          0               0         0                0
# … with 14 more rows

【讨论】：

以上是关于按月计算的变量频率的主要内容，如果未能解决你的问题，请参考以下文章