使用 dplyr mutate 函数根据当前行有条件地创建新变量

Posted 2023-03-24

技术标签:

【中文标题】使用 dplyr mutate 函数根据当前行有条件地创建新变量【英文标题】：Using dplyr mutate function to create new variable conditionally based on current row 【发布时间】：2020-09-21 08:11:55 【问题描述】：

我正在为一个大型数据集创建条件平均值，该数据集涉及几年内一周内出现的流感病例数。数据组织如下：

我想做的是创建一个新列，将前几年同一周的平均病例数制成表格。例如，对于 Week.Number 为 1 且 Flu.Year 为 2017 的行，我希望新行给出 Week.Number==1 & Flu.Year

   mutate(average = case_when(
    Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
    Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
    Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
    Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
  ),

但是，由于有 4 年的数据 * 52 周，因此需要大量迭代才能阐明条件。有没有办法在 dplyr 中优雅地编码？我一直遇到的问题是我想根据 Week.Number 和 Flu.Year 的当前值在其他行中根据 Week.Number 和 Flu.Year 值调用计数列中的值，我不知道如何做到这一点。如果我可以提供更多信息/详细信息，请告诉我。

谢谢，史蒂文

dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )

【问题讨论】：

请不要发布代码/数据/错误的图像：它不能被复制或搜索 (SEO)，它会破坏屏幕阅读器，并且它可能不适合某些移动设备。参考：meta.***.com/a/285557（和xkcd.com/2116）。请直接包含代码、控制台输出或数据（例如，dput(head(x)) 或 data.frame(...)）。这里有一些很好的参考资料，可以提供一个独立的、可重复的问题：***.com/q/5963269、minimal reproducible example 和 ***.com/tags/r/info。 【参考方案1】：

这是错误的格式，在某些情况下，当您在 dplyr 动词中使用 $-indexing 时会出错。我认为获得average 字段的更好方法是group_by(Flu.Year) 并直接计算它。

library(dplyr)
set.seed(42)
dat <- tibble(
  Flu.Year = sample(2016:2020, size=100, replace=TRUE),
  count = sample(1000, size=100, replace=TRUE)
)

dat %>%
  group_by(Flu.Year) %>%
  mutate(average = mean(count)) %>%
  # just to show a quick summary
  slice(1:3) %>%
  ungroup()
# # A tibble: 15 x 3
#    Flu.Year count average
#       <int> <int>   <dbl>
#  1     2016   734    578.
#  2     2016   356    578.
#  3     2016   411    578.
#  4     2017   217    436.
#  5     2017   453    436.
#  6     2017   920    436.
#  7     2018   963    558 
#  8     2018   609    558 
#  9     2018   536    558 
# 10     2019   943    543.
# 11     2019   740    543.
# 12     2019   536    543.
# 13     2020   627    494.
# 14     2020   218    494.
# 15     2020   389    494.

另一种方法是生成一个汇总表（每年仅一行）并将其重新连接到原始数据中。

dat %>%
  group_by(Flu.Year) %>%
  summarize(average = mean(count))
# # A tibble: 5 x 2
#   Flu.Year average
#      <int>   <dbl>
# 1     2016    578.
# 2     2017    436.
# 3     2018    558 
# 4     2019    543.
# 5     2020    494.

dat %>%
  group_by(Flu.Year) %>%
  summarize(average = mean(count)) %>%
  full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
#    Flu.Year average count
#       <int>   <dbl> <int>
#  1     2016    578.   734
#  2     2016    578.   356
#  3     2016    578.   411
#  4     2016    578.   720
#  5     2016    578.   851
#  6     2016    578.   822
#  7     2016    578.   465
#  8     2016    578.   679
#  9     2016    578.    30
# 10     2016    578.   180
# # ... with 90 more rows

chat之后的结果：

tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 )  %>%
  arrange(Flu.Year, Week.Number) %>%
  group_by(Week.Number) %>%
  mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups:   Week.Number [3]
#   Flu.Year Week.Number count year_week.average
#      <int>       <int> <int>             <dbl>
# 1     2016           1     1              NA  
# 2     2016           2     2              NA  
# 3     2016           3     3              NA  
# 4     2017           1     4               1  
# 5     2017           2     5               2  
# 6     2017           3     6               3  
# 7     2018           1     7               2.5
# 8     2018           2     8               3.5
# 9     2018           3     9               4.5

【讨论】：

感谢您的建议 - 但是，我仍然不确定如何根据前几年的同一周计算每周的平均值（例如，2016 年第 1 周的平均交易量， 2017 年等）你以前用过group_by吗？这可能就像更改此代码以使用 group_by(Flu.Year, Week) 来获得一年中每周的平均值一样简单。如果没有，在您的问题中包含可用的样本数据会很有用。我考虑过使用 group_by 函数，因为汇总表可以让我获得多年来平均每周的数据，但这仍然是我所寻找的。我需要前几年每周的平均值。因此，每一列都有不同的值。仍然需要一些条件，其中 Flu.Year 好的。更重要的是...请提供示例数据，并在给定示例数据的情况下添加预期输出。设置方法如下： dat 【参考方案2】：

我们可以从base R使用aggregate

aggregate(count ~ Flu.Year, data, FUN = mean)

【讨论】：

以上是关于使用 dplyr mutate 函数根据当前行有条件地创建新变量的主要内容，如果未能解决你的问题，请参考以下文章