如何创建一个变量，该变量是给定时间范围内连续行的总和并按 id

Posted 2023-03-04

技术标签:

【中文标题】如何创建一个变量，该变量是给定时间范围内连续行的总和并按 id【英文标题】：How to create a variable that is the sum of consecutive rows within a given time frame and by id 【发布时间】：2021-09-27 03:43:42 【问题描述】：

我正在尝试实现相隔 365 天的连续值的总和，按 R 中的唯一标识符分组。例如，对于特定 ID 的日期 1，我们将添加日期 2、3、4 （在 365 天内）相同 ID 获得日期 1 的总成本。然后对于日期 2，我们将添加 3 和 4 以获得总成本，依此类推。我已经尝试了几个滚动求和（R dplyr rolling sum）和来自 dplyr 的类似解决方案，它们在某些约束下采用连续值求和（Calculate sum of a column if the difference between consecutive rows meets a condition），但无法获得区分天数的代码。我已经包含了一个示例数据集和一个解决方案数据集，作为我正在寻找的示例。

起始数据集

ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,5000,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df2<-data.frame(ID,admitdt,cost,cost365)

  ID    admitdt  cost
1   1 2014-10-19  2000
2   1 2014-10-24 14077
3   1 2015-01-31  5000
4   1 2016-01-20   200
5   1 2017-06-30   560
6   1 2017-07-17  5000
7   2 2015-04-21   888
8   2 2015-04-22  5959
9   2 2015-05-04  1819
10  2 2015-07-25  7508
11  3 2014-11-11  6406

解决方案：

ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,500,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df2<-data.frame(ID,admitdt,cost,cost365)
  ID    admitdt  cost cost365
1   1 2014-10-19  2000   21077
2   1 2014-10-24 14077   19077
3   1 2015-01-31  5000    5200
4   1 2016-01-20   200     200
5   1 2017-06-30   560    5560
6   1 2017-07-17  5000    5000
7   2 2015-04-21   888   16174
8   2 2015-04-22  5959   15286
9   2 2015-05-04  1819    9327
10  2 2015-07-25  7508    7508
11  3 2014-11-11  6406    6406

【问题讨论】：

真实数据有多少行？ @IanCampbell ，大约有十万行。 【参考方案1】：

我们也可以使用以下解决方案：

library(dplyr)
library(purrr)
library(lubridate)

df2 %>%
  mutate(rolls = map2(ymd(admitdt), ID, ~ df2 %>%
                        filter(ID == .y & ymd(admitdt) %within% interval(.x, .x + 365)) %>%
                        pull(cost) %>%
                        reduce(`+`)))

   ID    admitdt  cost cost365 rolls
1   1 2014-10-19  2000   21077 21077
2   1 2014-10-24 14077   19077 19077
3   1 2015-01-31  5000    5200  5200
4   1 2016-01-20   200     200   200
5   1 2017-06-30   560    5560  5560
6   1 2017-07-17  5000    5000  5000
7   2 2015-04-21   888   16174 16174
8   2 2015-04-22  5959   15286 15286
9   2 2015-05-04  1819    9327  9327
10  2 2015-07-25  7508    7508  7508
11  3 2014-11-11  6406    6406  6406

或者在base R中：

df2$rolls <- mapply(function(x, y) 
  df2 <- transform(df2, admitdt = as.Date(admitdt, format = "%Y-%m-%d"))
  tmp <- subset(df2, ID == x & admitdt >= y & admitdt <= y + 365)
  sum(tmp$cost)
, df2$ID, as.Date(df2$admitdt, format = "%Y-%m-%d"))

【讨论】：

【参考方案2】：

我在slider 和runner 中分别给出了2 个方法。其中我喜欢slider，因为它的语法清晰。尽管如此，两者的策略几乎相同，

date 列将在两者中充当 index。滑块提供更多控制，因为它具有 .before 和 .after 参数，在本例中您只需要 after = days(365)（与 lubridate 集成）在 runner k 中总是向后，所以我在那里使用了-364。休息好了。如果需要进一步澄清，请询问。

在slider你可以做

library(tidyverse)

ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,5000,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df<-data.frame(ID,admitdt,cost)

df
#>    ID    admitdt  cost
#> 1   1 2014-10-19  2000
#> 2   1 2014-10-24 14077
#> 3   1 2015-01-31  5000
#> 4   1 2016-01-20   200
#> 5   1 2017-06-30   560
#> 6   1 2017-07-17  5000
#> 7   2 2015-04-21   888
#> 8   2 2015-04-22  5959
#> 9   2 2015-05-04  1819
#> 10  2 2015-07-25  7508
#> 11  3 2014-11-11  6406

library(slider)
library(lubridate)

df %>% group_by(ID) %>%
  mutate(admitdt = as.Date(admitdt),
              cost365 = slider::slide_index_sum(x = cost,
                                                i = admitdt,
                                                after = days(365)))
#> # A tibble: 11 x 4
#> # Groups:   ID [3]
#>       ID admitdt     cost cost365
#>    <dbl> <date>     <dbl>   <dbl>
#>  1     1 2014-10-19  2000   21077
#>  2     1 2014-10-24 14077   19077
#>  3     1 2015-01-31  5000    5200
#>  4     1 2016-01-20   200     200
#>  5     1 2017-06-30   560    5560
#>  6     1 2017-07-17  5000    5000
#>  7     2 2015-04-21   888   16174
#>  8     2 2015-04-22  5959   15286
#>  9     2 2015-05-04  1819    9327
#> 10     2 2015-07-25  7508    7508
#> 11     3 2014-11-11  6406    6406

或在runner

library(dplyr, warn.conflicts = F)

ID <- c(1,1,1,1,1,1,2,2,2,2,3)
admitdt <-c("2014-10-19","2014-10-24","2015-01-31","2016-01-20","2017-06-30","2017-07-17","2015-04-21","2015-04-22","2015-05-04","2015-07-25","2014-11-11")
cost<-c(2000,14077,5000,200,560,5000,888,5959,1819,7508,6406)
cost365<-c(21077,19077,5200,200,5560,5000,16174,15286,9327,7508,6406)
df<-data.frame(ID,admitdt,cost)

library(runner)

df %>% group_by(ID) %>%
  mutate(admitdt = as.Date(admitdt),
         cost365 = runner::sum_run(x = cost,
                                   idx = admitdt,
                                   k = 365,
                                   lag = -364))
#> # A tibble: 11 x 4
#> # Groups:   ID [3]
#>       ID admitdt     cost cost365
#>    <dbl> <date>     <dbl>   <dbl>
#>  1     1 2014-10-19  2000   21077
#>  2     1 2014-10-24 14077   19077
#>  3     1 2015-01-31  5000    5200
#>  4     1 2016-01-20   200     200
#>  5     1 2017-06-30   560    5560
#>  6     1 2017-07-17  5000    5000
#>  7     2 2015-04-21   888   16174
#>  8     2 2015-04-22  5959   15286
#>  9     2 2015-05-04  1819    9327
#> 10     2 2015-07-25  7508    7508
#> 11     3 2014-11-11  6406    6406

^{由reprex package (v2.0.0) 于 2021-07-19 创建}

【讨论】：

谢谢@AnilGoyal，这个解决方案也可以。【参考方案3】：

这是purrr::map 的一种方法：

library(dplyr); library(purrr)
df2 %>%
  mutate(admitdt = as.Date(admitdt)) %>%
  group_by(ID) %>%
  mutate(cost365 = map_dbl(admitdt,~sum(cost[(.x - admitdt) <= 0 &
                                             (.x - admitdt) >= -365])))
# A tibble: 11 x 4
# Groups:   ID [3]
      ID admitdt     cost cost365
   <dbl> <date>     <dbl>   <dbl>
 1     1 2014-10-19  2000   21077
 2     1 2014-10-24 14077   19077
 3     1 2015-01-31  5000    5200
 4     1 2016-01-20   200     200
 5     1 2017-06-30   560    1060
 6     1 2017-07-17   500     500
 7     2 2015-04-21   888   16174
 8     2 2015-04-22  5959   15286
 9     2 2015-05-04  1819    9327
10     2 2015-07-25  7508    7508
11     3 2014-11-11  6406    6406

【讨论】：

我们信任伊恩坎贝尔。不错的方法@Ian，+1 感谢@IanCampbell 的回答，效果很好。

以上是关于如何创建一个变量，该变量是给定时间范围内连续行的总和并按 id的主要内容，如果未能解决你的问题，请参考以下文章

如何在给定范围内生成随机数作为 Tensorflow 变量

如何为熊猫中的多个变量按列创建所有组合？

当我们在java中创建一个setter方法时，java如何知道我们要从setter中设置给定值的变量是哪个？请阅读说明

错误：尝试在For循环中使用零长度变量名

具有共享对象引用的两个变量的总大小是多少？ [复制]

从给定范围中选择行时，sqlite3.OperationalError“SQL 变量太多”