从特定季节提取滞后数据,但仅针对特定数据集,如 R 中的变量所示

Posted

技术标签:

【中文标题】从特定季节提取滞后数据,但仅针对特定数据集,如 R 中的变量所示【英文标题】:Pulling lagged data from a particular season but only for specific data sets as indicated by variable in R 【发布时间】:2022-01-17 07:20:54 【问题描述】:

我最初的询问来自这个问题:Pulling lagged data but only for a particular season in R

这回答了我对特定数据框的问题;但是,现在我有一个大型聚合数据框,需要添加一行代码来说明每个单独的数据集(Lake_name)。

这是我的数据:

   SeasonYear       change   Lake_name
1  winter2020  0.007877245   AlanHenry
2  spring2020  0.058515310   AlanHenry
3  summer2020  0.013850687   AlanHenry
4    fall2020 -0.071774781   AlanHenry
5  winter2021 -0.040268206   AlanHenry
6  spring2021 -0.020803715   AlanHenry
7  summer2021  0.181610974   AlanHenry
8  winter2020 -0.029708916     Amistad
9  spring2020 -0.063310371     Amistad
10 summer2020 -0.054231575     Amistad
11   fall2020  0.016057252     Amistad
12 winter2021  0.011785717     Amistad
13 spring2021 -0.030677687     Amistad
14 summer2021 -0.015691720     Amistad
15 winter2020 -0.011974634 AmonGCarter
16 spring2020  0.168774234 AmonGCarter
17 summer2020 -0.041486735 AmonGCarter
18   fall2020 -0.095134974 AmonGCarter
19 winter2021 -0.030310177 AmonGCarter
20 spring2021  0.033528325 AmonGCarter

我正在尝试构建一个函数,该函数可以消除上一个春天的滞后(参见上一篇文章),但也可以考虑每个湖泊。如果我将它单独分开,我可以做到这一点,但我有一个相当大的数据集,这需要很长时间才能做到。这是我尝试使用的代码(根据我引用的帖子修改):

library(dplyr)
lag_spring <- function(x, y, n = 1) 
  data.frame(x = x, season_year = y) %>% 
    group_by(Lake_name) %>%
    tidyr::extract(season_year, into = c("season", "year"), regex = "^(.+?)(\\d4)$") %>%
    group_by(year) %>%
    mutate(springmean = x[season == "spring"]) %>%
    ungroup() %>%
    group_by(season) %>%
    mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
    ungroup() %>%
    pull(lag)

我尝试在每个湖中添加 group_by(Lake_name) 来执行此操作,但是当我运行代码时:

data %>%  mutate(springlag = lag_spring(change, SeasonYear,n=1),
         springlag2 = lag_spring(change, SeasonYear,n=2),
         springlag3 = lag_spring(change, SeasonYear,n=3))

我收到此错误:

错误:mutate() 输入弹簧滞后问题。 x 必须按 .data 中的变量分组。 未找到列 Lake_name。 i 输入 springlag 为 lag_spring(change, SeasonYear, n = 1)

有人可以帮助修改我之前获得的代码以获得“springlag”,但在 dplyr 中包含仅在每个单独的湖中执行此操作的行吗?

【问题讨论】:

【参考方案1】:

无需更改功能。您可以在计算滞后的mutate 之前使用group_by 来达到您想要的结果:

library(tidyr)
library(dplyr)

lag_spring <- function(x, y, n = 1) 
  data.frame(x = x, season_year = y) %>%
    tidyr::extract(season_year, into = c("season", "year"), regex = "^(.+?)(\\d4)$") %>%
    group_by(year) %>%
    mutate(springmean = if (any(season == "spring")) x[season == "spring"] else NA) %>%
    ungroup() %>%
    group_by(season) %>%
    mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
    ungroup() %>%
    pull(lag)


dd %>%
  group_by(Lake_name) %>%
  mutate(lag = lag_spring(change, SeasonYear))
#> # A tibble: 20 × 4
#> # Groups:   Lake_name [3]
#>    SeasonYear   change Lake_name       lag
#>    <chr>         <dbl> <chr>         <dbl>
#>  1 winter2020  0.00788 AlanHenry   NA     
#>  2 spring2020  0.0585  AlanHenry   NA     
#>  3 summer2020  0.0139  AlanHenry    0.0585
#>  4 fall2020   -0.0718  AlanHenry    0.0585
#>  5 winter2021 -0.0403  AlanHenry    0.0585
#>  6 spring2021 -0.0208  AlanHenry    0.0585
#>  7 summer2021  0.182   AlanHenry   -0.0208
#>  8 winter2020 -0.0297  Amistad     NA     
#>  9 spring2020 -0.0633  Amistad     NA     
#> 10 summer2020 -0.0542  Amistad     -0.0633
#> 11 fall2020    0.0161  Amistad     -0.0633
#> 12 winter2021  0.0118  Amistad     -0.0633
#> 13 spring2021 -0.0307  Amistad     -0.0633
#> 14 summer2021 -0.0157  Amistad     -0.0307
#> 15 winter2020 -0.0120  AmonGCarter NA     
#> 16 spring2020  0.169   AmonGCarter NA     
#> 17 summer2020 -0.0415  AmonGCarter  0.169 
#> 18 fall2020   -0.0951  AmonGCarter  0.169 
#> 19 winter2021 -0.0303  AmonGCarter  0.169 
#> 20 spring2021  0.0335  AmonGCarter  0.169

数据

dd <- structure(list(SeasonYear = c(
  "winter2020", "spring2020", "summer2020",
  "fall2020", "winter2021", "spring2021", "summer2021", "winter2020",
  "spring2020", "summer2020", "fall2020", "winter2021", "spring2021",
  "summer2021", "winter2020", "spring2020", "summer2020", "fall2020",
  "winter2021", "spring2021"
), change = c(
  0.007877245, 0.05851531,
  0.013850687, -0.071774781, -0.040268206, -0.020803715, 0.181610974,
  -0.029708916, -0.063310371, -0.054231575, 0.016057252, 0.011785717,
  -0.030677687, -0.01569172, -0.011974634, 0.168774234, -0.041486735,
  -0.095134974, -0.030310177, 0.033528325
), Lake_name = c(
  "AlanHenry",
  "AlanHenry", "AlanHenry", "AlanHenry", "AlanHenry", "AlanHenry",
  "AlanHenry", "Amistad", "Amistad", "Amistad", "Amistad", "Amistad",
  "Amistad", "Amistad", "AmonGCarter", "AmonGCarter", "AmonGCarter",
  "AmonGCarter", "AmonGCarter", "AmonGCarter"
)), class = "data.frame", row.names = c(
  "1",
  "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
  "14", "15", "16", "17", "18", "19", "20"
))

【讨论】:

这适用于 dd 数据集,但是当我尝试将其应用于我的数据时,我得到一个错误。错误:mutate() 输入 lag 有问题。 x mutate() 输入 springmean 有问题。 x 输入 springmean 无法回收到尺寸 2。 i 输入 springmeanx[season == "spring"]。 i 输入springmean 的大小必须为 2 或 1,而不是 0。 i 组 1 中发生错误:年 =“2009”。 i 输入laglag_spring(change, SeasonYear)。 i 组 1 中发生的错误:year = "2009"。 我尝试像您的 dd &lt;- structure(list(SeasonYear = c(raw.WL.season$SeasonYear), change = c(raw.WL.season$change), Lake_name = c(raw.WL.season$Lake_name)), class = "data.frame", row.names = c(1:nrow(raw.WL.season))) 一样构建我的数据框,但我仍然收到该错误。我想知道您是否可以帮助我弄清楚如何避免该错误。 嗨@DavidSmith。我刚刚进行了编辑并稍微更改了功能。我的功能的一个问题是它只有在存在滞后的“弹簧”时才有效。如果不是这种情况,x[season == "spring"] 将不起作用并导致您收到错误。不确定这是否确实是问题,但您可以尝试一下。 现在完美运行!谢谢!

以上是关于从特定季节提取滞后数据,但仅针对特定数据集,如 R 中的变量所示的主要内容,如果未能解决你的问题,请参考以下文章

R arules,仅从特定列中挖掘规则

从 NSPersistentStoreCoordinator 提取特定更新

答果子问R语言如何用正则表达式提取特定的字符串

从 acf 输出列表中提取特定元素

循环遍历指定文件夹中的所有 excel 文件,并从特定单元格中提取数据的代码

RethinkDB 从文档中提取特定列