如何编写一个循环，其中数据集被拆分，每个拆分的趋势线斜率在 R 中给出

Posted 2023-04-18

技术标签:

【中文标题】如何编写一个循环，其中数据集被拆分，每个拆分的趋势线斜率在 R 中给出【英文标题】：how to write a loop in which a dataset is split and the slope of the trendline for each split is given in R 【发布时间】：2017-09-12 16:01:01 【问题描述】：

我有一个问题，经过数小时的研究，我找不到解决方案，所以也许你们中的一个可以帮助我解决这个问题：

我的数据框如下所示：

stations_id phase_id refyear day  
140 10 1992 260  
140 10 1993 263   
140 10 1995 260  
140 10 1995 257   
140 12 1993 286  
140 12 1994 289  
140 12 1997 290  
150 10 1992 260  
150 10 1993 270  
150 10 1994 274  
165 15 1992 310

数据框有大约 600,000 行，因此我拼命尝试创建一个 for 循环，该循环以“refyear”作为自变量，“day”作为“stations_id”的每个组合的因变量来输出回归线的斜率" 和 "phase_id";因此分裂取决于两个变量。但是，我真的找不到解决方案，如果有人可以帮助我，我将不胜感激！

最好的问候

【问题讨论】：

【参考方案1】：

使用dplyr 和broom，您可以按组对refyear 与day 进行建模，而无需使用循环并返回具有模型系数的数据框。在下面的代码中，回归系数位于estimate 列中。回归斜率位于term 等于“天”的行中。

library(tidyverse)
library(broom)

models = dat %>% group_by(stations_id, phase_id) %>% 
  do(tidy(lm(refyear ~ day, data=.)))

  stations_id phase_id        term     estimate    std.error   statistic     p.value
        <int>    <int>       <chr>        <dbl>        <dbl>       <dbl>       <dbl>
1         140       10 (Intercept) 2080.4166667  94.44595383  22.0275891 0.002054594
2         140       10         day   -0.3333333   0.36324158  -0.9176629 0.455668946
3         140       12 (Intercept) 1750.6923077 153.66666453  11.3927917 0.055736327
4         140       12         day    0.8461538   0.53293871   1.5877132 0.357824750
5         150       10 (Intercept) 1956.9230769   8.92887743 219.1678734 0.002904693
6         150       10         day    0.1346154   0.03330867   4.0414519 0.154420958
7         165       15 (Intercept) 1992.0000000          NaN         NaN         NaN

【讨论】：

【参考方案2】：

这是一个 tidyverse/purrr 解决方案，我认为它比 for 循环版本更干净。

library(tidyverse)
library(purrr)
d <- read_csv("stations_id, phase_id, refyear, day  
140, 10, 1992, 260  
140, 10, 1993, 263   
140, 10, 1995, 260  
140, 10, 1995, 257   
140, 12, 1993, 286  
140, 12, 1994, 289  
140, 12, 1997, 290  
150, 10, 1992, 260  
150, 10, 1993, 270  
150, 10, 1994, 274  
165, 15, 1992, 310")

nested <- d %>% 
  group_by(stations_id, phase_id) %>% 
  nest()  

nested <- nested %>% 
  mutate(mod = map(data, ~lm(day ~ refyear, data = .)))

map(nested$mod, coef)

[[1]]
 (Intercept)      refyear 
2032.2222222   -0.8888889 

[[2]]
  (Intercept)       refyear 
-1399.4615385     0.8461538 

[[3]]
(Intercept)     refyear 
     -13683           7 

[[4]]
(Intercept)     refyear 
        310          NA

【讨论】：

【参考方案3】：

您可以使用tidyverse 来实现此目的。

首先按变量分组，然后按该分组tidyr::nest 数据。现在您有一个列表列，其中包含分组变量的每个组合的非分组变量的所有数据。

然后，您可以在 dplyr::mutate 中使用 purrr::map 来迭代新的列表列，以便在列表列中的每个单独的 daraframe 上拟合您的模型。现在您有一个包含模型的附加列表列。然后，您可以再次迭代这些，从每个模型中获取所需的系数。

最后，您可以只选择斜率，并且分组变量与模型中的斜率的每个组合都有单行。或者您可以unnest 数据并将斜率添加为一个新列，该列对分组变量的所有值重复。

有关此类工作流程的更详细指南，请查看来自R for Data Science 的有关many models 的章节

library(tidyverse)

nested <- mtcars %>% 
  select(cyl, mpg, wt) %>% 
  group_by(cyl) %>% 
  nest()

#> # A tibble: 3 x 2
#>     cyl              data
#>   <dbl>            <list>
#> 1     6  <tibble [7 x 2]>
#> 2     4 <tibble [11 x 2]>
#> 3     8 <tibble [14 x 2]>

models <- nested %>% 
  mutate(
    model = map(data, ~lm(mpg ~ wt, data = .x)),
    slope = map_dbl(model, c("coefficients", "wt"))
  )

#> # A tibble: 3 x 4
#>     cyl              data    model     slope
#>   <dbl>            <list>   <list>     <dbl>
#> 1     6  <tibble [7 x 2]> <S3: lm> -2.780106
#> 2     4 <tibble [11 x 2]> <S3: lm> -5.647025
#> 3     8 <tibble [14 x 2]> <S3: lm> -2.192438

models %>% select(cyl, slope)

#> # A tibble: 3 x 2
#>     cyl     slope
#>   <dbl>     <dbl>
#> 1     6 -2.780106
#> 2     4 -5.647025
#> 3     8 -2.192438

models %>% select(-model) %>% unnest()

#> # A tibble: 32 x 4
#>      cyl     slope   mpg    wt
#>    <dbl>     <dbl> <dbl> <dbl>
#>  1     6 -2.780106  21.0 2.620
#>  2     6 -2.780106  21.0 2.875
#>  3     6 -2.780106  21.4 3.215
#>  4     6 -2.780106  18.1 3.460
#>  5     6 -2.780106  19.2 3.440
#>  6     6 -2.780106  17.8 3.440
#>  7     6 -2.780106  19.7 2.770
#>  8     4 -5.647025  22.8 2.320
#>  9     4 -5.647025  24.4 3.190
#> 10     4 -5.647025  22.8 3.150
#> # ... with 22 more rows

【讨论】：

以上是关于如何编写一个循环，其中数据集被拆分，每个拆分的趋势线斜率在 R 中给出的主要内容，如果未能解决你的问题，请参考以下文章