从特定季节提取滞后数据,但仅针对特定数据集,如 R 中的变量所示
Posted
技术标签:
【中文标题】从特定季节提取滞后数据,但仅针对特定数据集,如 R 中的变量所示【英文标题】:Pulling lagged data from a particular season but only for specific data sets as indicated by variable in R 【发布时间】:2022-01-17 07:20:54 【问题描述】:我最初的询问来自这个问题:Pulling lagged data but only for a particular season in R
这回答了我对特定数据框的问题;但是,现在我有一个大型聚合数据框,需要添加一行代码来说明每个单独的数据集(Lake_name)。
这是我的数据:
SeasonYear change Lake_name
1 winter2020 0.007877245 AlanHenry
2 spring2020 0.058515310 AlanHenry
3 summer2020 0.013850687 AlanHenry
4 fall2020 -0.071774781 AlanHenry
5 winter2021 -0.040268206 AlanHenry
6 spring2021 -0.020803715 AlanHenry
7 summer2021 0.181610974 AlanHenry
8 winter2020 -0.029708916 Amistad
9 spring2020 -0.063310371 Amistad
10 summer2020 -0.054231575 Amistad
11 fall2020 0.016057252 Amistad
12 winter2021 0.011785717 Amistad
13 spring2021 -0.030677687 Amistad
14 summer2021 -0.015691720 Amistad
15 winter2020 -0.011974634 AmonGCarter
16 spring2020 0.168774234 AmonGCarter
17 summer2020 -0.041486735 AmonGCarter
18 fall2020 -0.095134974 AmonGCarter
19 winter2021 -0.030310177 AmonGCarter
20 spring2021 0.033528325 AmonGCarter
我正在尝试构建一个函数,该函数可以消除上一个春天的滞后(参见上一篇文章),但也可以考虑每个湖泊。如果我将它单独分开,我可以做到这一点,但我有一个相当大的数据集,这需要很长时间才能做到。这是我尝试使用的代码(根据我引用的帖子修改):
library(dplyr)
lag_spring <- function(x, y, n = 1)
data.frame(x = x, season_year = y) %>%
group_by(Lake_name) %>%
tidyr::extract(season_year, into = c("season", "year"), regex = "^(.+?)(\\d4)$") %>%
group_by(year) %>%
mutate(springmean = x[season == "spring"]) %>%
ungroup() %>%
group_by(season) %>%
mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
ungroup() %>%
pull(lag)
我尝试在每个湖中添加 group_by(Lake_name)
来执行此操作,但是当我运行代码时:
data %>% mutate(springlag = lag_spring(change, SeasonYear,n=1),
springlag2 = lag_spring(change, SeasonYear,n=2),
springlag3 = lag_spring(change, SeasonYear,n=3))
我收到此错误:
错误:mutate() 输入弹簧滞后问题。 x 必须按 .data 中的变量分组。 未找到列 Lake_name。 i 输入 springlag 为 lag_spring(change, SeasonYear, n = 1)
有人可以帮助修改我之前获得的代码以获得“springlag”,但在 dplyr 中包含仅在每个单独的湖中执行此操作的行吗?
【问题讨论】:
【参考方案1】:无需更改功能。您可以在计算滞后的mutate
之前使用group_by
来达到您想要的结果:
library(tidyr)
library(dplyr)
lag_spring <- function(x, y, n = 1)
data.frame(x = x, season_year = y) %>%
tidyr::extract(season_year, into = c("season", "year"), regex = "^(.+?)(\\d4)$") %>%
group_by(year) %>%
mutate(springmean = if (any(season == "spring")) x[season == "spring"] else NA) %>%
ungroup() %>%
group_by(season) %>%
mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
ungroup() %>%
pull(lag)
dd %>%
group_by(Lake_name) %>%
mutate(lag = lag_spring(change, SeasonYear))
#> # A tibble: 20 × 4
#> # Groups: Lake_name [3]
#> SeasonYear change Lake_name lag
#> <chr> <dbl> <chr> <dbl>
#> 1 winter2020 0.00788 AlanHenry NA
#> 2 spring2020 0.0585 AlanHenry NA
#> 3 summer2020 0.0139 AlanHenry 0.0585
#> 4 fall2020 -0.0718 AlanHenry 0.0585
#> 5 winter2021 -0.0403 AlanHenry 0.0585
#> 6 spring2021 -0.0208 AlanHenry 0.0585
#> 7 summer2021 0.182 AlanHenry -0.0208
#> 8 winter2020 -0.0297 Amistad NA
#> 9 spring2020 -0.0633 Amistad NA
#> 10 summer2020 -0.0542 Amistad -0.0633
#> 11 fall2020 0.0161 Amistad -0.0633
#> 12 winter2021 0.0118 Amistad -0.0633
#> 13 spring2021 -0.0307 Amistad -0.0633
#> 14 summer2021 -0.0157 Amistad -0.0307
#> 15 winter2020 -0.0120 AmonGCarter NA
#> 16 spring2020 0.169 AmonGCarter NA
#> 17 summer2020 -0.0415 AmonGCarter 0.169
#> 18 fall2020 -0.0951 AmonGCarter 0.169
#> 19 winter2021 -0.0303 AmonGCarter 0.169
#> 20 spring2021 0.0335 AmonGCarter 0.169
数据
dd <- structure(list(SeasonYear = c(
"winter2020", "spring2020", "summer2020",
"fall2020", "winter2021", "spring2021", "summer2021", "winter2020",
"spring2020", "summer2020", "fall2020", "winter2021", "spring2021",
"summer2021", "winter2020", "spring2020", "summer2020", "fall2020",
"winter2021", "spring2021"
), change = c(
0.007877245, 0.05851531,
0.013850687, -0.071774781, -0.040268206, -0.020803715, 0.181610974,
-0.029708916, -0.063310371, -0.054231575, 0.016057252, 0.011785717,
-0.030677687, -0.01569172, -0.011974634, 0.168774234, -0.041486735,
-0.095134974, -0.030310177, 0.033528325
), Lake_name = c(
"AlanHenry",
"AlanHenry", "AlanHenry", "AlanHenry", "AlanHenry", "AlanHenry",
"AlanHenry", "Amistad", "Amistad", "Amistad", "Amistad", "Amistad",
"Amistad", "Amistad", "AmonGCarter", "AmonGCarter", "AmonGCarter",
"AmonGCarter", "AmonGCarter", "AmonGCarter"
)), class = "data.frame", row.names = c(
"1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"
))
【讨论】:
这适用于 dd 数据集,但是当我尝试将其应用于我的数据时,我得到一个错误。错误:mutate()
输入 lag
有问题。 x mutate()
输入 springmean
有问题。 x 输入 springmean
无法回收到尺寸 2。 i 输入 springmean
是 x[season == "spring"]
。 i 输入springmean
的大小必须为 2 或 1,而不是 0。 i 组 1 中发生错误:年 =“2009”。 i 输入lag
是lag_spring(change, SeasonYear)
。 i 组 1 中发生的错误:year = "2009"。
我尝试像您的 dd <- structure(list(SeasonYear = c(raw.WL.season$SeasonYear), change = c(raw.WL.season$change), Lake_name = c(raw.WL.season$Lake_name)), class = "data.frame", row.names = c(1:nrow(raw.WL.season)))
一样构建我的数据框,但我仍然收到该错误。我想知道您是否可以帮助我弄清楚如何避免该错误。
嗨@DavidSmith。我刚刚进行了编辑并稍微更改了功能。我的功能的一个问题是它只有在存在滞后的“弹簧”时才有效。如果不是这种情况,x[season == "spring"]
将不起作用并导致您收到错误。不确定这是否确实是问题,但您可以尝试一下。
现在完美运行!谢谢!以上是关于从特定季节提取滞后数据,但仅针对特定数据集,如 R 中的变量所示的主要内容,如果未能解决你的问题,请参考以下文章
从 NSPersistentStoreCoordinator 提取特定更新