查找 p 值和 z 统计量以及 OLS 线性回归

Posted 2023-02-18

技术标签:

【中文标题】查找 p 值和 z 统计量以及 OLS 线性回归【英文标题】：Finding p-value and z statistics along with the OLS Linear regression 【发布时间】：2021-11-26 11:33:32 【问题描述】：

我可以从线性回归中找到系数和截距，但找不到合适的方法来获取相应变量趋势的 p 值和 z 值。此外，无法找到将输出结果保存为 excel 格式的方法。数据为here。时间有 24 个变量。我没有得到 z 统计量和 p 值，此外，第一种方法的估计也是不正确的。我哪里错了？

library("trend")

# read ozone data (I converted to a text file first)
otm <- read.table("D:/data.txt",header=T)


#  make a data frame version
otm_df <- data.frame(otm)
markers <- sample(0:1, replace = T, size = 11)

# calculate OLS slope for all columns
# the -1 at end removes the intercepts
ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])

我试过这个方法。我没有得到 z 统计数据，也无法将其保存为 excel 格式。

library(reshape2)
DF <- reshape2::melt(otm, id.var = "Year")
library(broom); library(tidyverse)
ols <- DF %>% nest(data = -variable) %>% 
  mutate(model = map(data, ~lm(value ~ Year, data = .)), 
         tidied = map(model, tidy)) %>% 
  unnest(tidied)

#to save the results in excel format (not working here for me)
capture.output(summary(ols), file = "ols.csv" )
write.csv(ols, file.path('E:/',filename = "ols2.csv"), row.names = TRUE)

# A tibble: 48 x 8
   variable data              model  term         estimate std.error statistic p.value
   <fct>    <list>            <list> <chr>           <dbl>     <dbl>     <dbl>   <dbl>
 1 BanTES   <tibble [11 x 2]> <lm>   (Intercept) -236.       488.       -0.483   0.641
 2 BanTES   <tibble [11 x 2]> <lm>   Year           0.139      0.242     0.572   0.582
 3 SriTES   <tibble [11 x 2]> <lm>   (Intercept)  220.       351.        0.627   0.546
 4 SriTES   <tibble [11 x 2]> <lm>   Year          -0.0935     0.174    -0.536   0.605
 5 AfgTES   <tibble [11 x 2]> <lm>   (Intercept)  364.       444.        0.820   0.434
 6 AfgTES   <tibble [11 x 2]> <lm>   Year          -0.161      0.221    -0.730   0.484
 7 BhuTES   <tibble [11 x 2]> <lm>   (Intercept)  373.       831.        0.449   0.664
 8 BhuTES   <tibble [11 x 2]> <lm>   Year          -0.170      0.413    -0.412   0.690
 9 IndTES   <tibble [11 x 2]> <lm>   (Intercept) -342.       213.       -1.60    0.143
10 IndTES   <tibble [11 x 2]> <lm>   Year           0.190      0.106     1.80    0.106

summary(ols)

    variable  data.Length  data.Class  data.Mode model.Length  model.Class  model.Mode     term          
 BanTES : 2   2       tbl_df  list               12    lm    list                      Length:48         
 SriTES : 2   2       tbl_df  list               12    lm    list                      Class :character  
 AfgTES : 2   2       tbl_df  list               12    lm    list                      Mode  :character  
 BhuTES : 2   2       tbl_df  list               12    lm    list                                        
 IndTES : 2   2       tbl_df  list               12    lm    list                                        
 NepTES : 2   2       tbl_df  list               12    lm    list                                        
 (Other):36   2       tbl_df  list               12    lm    list

任何帮助都会很有用。提前谢谢你！

【问题讨论】：

help("summary.lm") 没有关于 p 值和 z 统计的内容有（当然它是 t 统计量而不是 z 统计量，因为它应该是）。我用 excel 检查了估算值。 excel和R软件的OLS斜率值不匹配。为什么会这样？请provide a minimal reproducible example。 【参考方案1】：

线性回归不会像@Roland 正确评论的那样为您提供 Z 统计量，而是线性回归为您提供 t 统计量。您可以使用以下代码将coeff、t统计量和p值保存为excel格式

library(tidyverse)
library(broom)
library(openxlsx)

# read ozone data (I converted to a text file first)
otm <- read.table("data.txt", header=T)
head(otm, 2)

otm %>% 
  pivot_longer(-c("Year")) %>%
  group_by(name) %>% 
  do(fitlm = tidy(lm(value ~ Year, data = ., na.action=na.omit))) %>% 
  unnest(fitlm) %>% 
  write.xlsx("Linear_trend_1.xlsx")

在“Linear_trend_1.xlsx”文件中，每个位置的年份行为您提供斜率，统计量为 t 统计量。

第一种方法的值与第二种方法不匹配，因为第一种方法使用markers 作为因变量，而otm_df 中的所有变量都用作自变量。在第二种方法中，Year被用作自变量，而所有位置值都是因变量。另一件事，您正在使用sample 函数创建markers，这将为您每次运行提供不同的输出。所以你必须使用set.seed 让它在每次运行时都保持不变

set.seed(123)
otm %>% 
  mutate(markers = sample(0:1, replace = T, size = 11)) %>% 
  pivot_longer(-c("Year", "markers")) %>%
  group_by(name) %>% 
  do(fitlm = tidy(lm(markers ~ value, data = ., na.action=na.omit))) %>% 
  unnest(fitlm) %>% 
  write.xlsx("Linear_trend_2.xlsx")

# A tibble: 48 x 6
   name   term        estimate std.error statistic p.value
   <chr>  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 AfgERA (Intercept)  -8.58      7.79      -1.10    0.299
 2 AfgERA value         0.577     0.498      1.16    0.276
 3 AfgTES (Intercept)  -1.62      2.99      -0.542   0.601
 4 AfgTES value         0.0521    0.0750     0.695   0.505
 5 AfgTOM (Intercept) -25.3      19.5       -1.30    0.225
 6 AfgTOM value         1.94      1.46       1.33    0.218
 7 BanERA (Intercept)  -2.28      5.03      -0.453   0.662
 8 BanERA value         0.201     0.370      0.543   0.600
 9 BanTES (Intercept)  -3.42      2.80      -1.22    0.253
10 BanTES value         0.0892    0.0644     1.39    0.199
# ... with 38 more rows

如果我像下面这样修改你的第一种方法，你可以看到上面代码的输出和你的代码基本相同

#  make a data frame version
otm_df <- data.frame(otm)
set.seed(123) #To have same output from sample function for every run
markers <- sample(0:1, replace = T, size = 11)

# calculate OLS slope for all columns
# the -1 at end removes the intercepts
ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])

Year.x     BanTES.x     SriTES.x     AfgTES.x     BhuTES.x     IndTES.x 
 0.054545455  0.089176876  0.092629314  0.052132553  0.001602003  0.107446434 
    NepTES.x     PakTES.x      SATES.x     BanERA.x     SriERA.x     AfgERA.x 
 0.065125607 -0.115108438  0.315928753  0.200712285 -0.519996739  0.577317012 
    BhuERA.x     IndERA.x     NepERA.x     PakERA.x      SAERA.x     BanTOM.x 
-0.140990921  0.110578204 -0.030546686  0.265909319  0.176797510  1.897234338 
    SriTOM.x     AfgTOM.x     BhuTOM.x     IndTOm.x     NepTOM.x     PakTOM.x 
 5.380610477  1.935170281  1.761172711  2.248531891  2.107380452  2.011356580 
     SATOM.x 
 2.214848959

这是dput格式的数据

dput(otm)
structure(list(Year = 2008:2018, BanTES = c(44.06247376, 43.81239107, 
40.68010622, 46.97760506, 37.49591135, 43.81239107, 43.81239107, 
43.81239107, 43.81239107, 45.27803189, 44.06247376), SriTES = c(32.07268265, 
35.01918477, 29.91018035, 34.1291577, 28.5258431, 32.07268265, 
32.07268265, 32.07268265, 32.07268265, 30.96753552, 32.07268265
), AfgTES = c(39.19328867, 42.06325898, 42.31015918, 40.54543762, 
34.28696385, 40.54543762, 40.54543762, 40.54543762, 40.54543762, 
37.38974643, 39.19328867), BhuTES = c(34.08241824, 36.95440954, 
30.41561338, 30.37004394, 19.8861367, 30.41561338, 30.41561338, 
30.41561338, 30.41561338, 32.09933763, 32.09933763), IndTES = c(40.05913352, 
41.54741392, 38.88957844, 42.47544504, 43.24350644, 41.54741392, 
41.54741392, 41.54741392, 41.54741392, 42.65820983, 42.47544504
), NepTES = c(38.12979871, 37.62785275, 34.40488247, 37.7995467, 
39.64286364, 37.7995467, 37.7995467, 37.7995467, 37.7995467, 
30.63632105, 37.7995467), PakTES = c(41.38388734, 41.99865359, 
42.16236093, 42.51838941, 43.4444952, 42.16236093, 42.16236093, 
42.16236093, 42.16236093, 44.96627251, 42.51838941), SATES = c(40.03077, 
41.52302, 39.6327, 41.9098, 41.11191, 41.11191, 41.11191, 41.11191, 
41.11191, 41.57009, 41.52302), BanERA = c(13.76686693, 13.904453, 
13.40584856, 13.45199721, 13.47657436, 12.8992102, 13.3586098, 
14.23223365, 13.4228729, 13.21487616, 14.50830571), SriERA = c(11.81852768, 
11.79187354, 11.51484349, 11.50552588, 11.489789, 11.23384852, 
10.61182708, 11.33951759, 11.6357584, 11.74685028, 12.14987906
), AfgERA = c(15.44115983, 15.425995, 15.6161623, 15.47751927, 
15.81748069, 15.47498417, 15.41748855, 16.06462541, 15.61143062, 
15.32810621, 16.39162424), BhuERA = c(14.34493453, 14.28085419, 
14.24728543, 14.03202106, 14.04152053, 13.42977221, 13.22665229, 
14.344052, 13.58792484, 13.28851619, 14.28029524), IndERA = c(14.08262362, 
14.11485037, 13.80713493, 13.86114379, 13.92607879, 13.37996473, 
13.45767152, 14.49365275, 13.88142768, 13.73986257, 14.77032404
), NepERA = c(14.93883379, 14.896056, 14.78607828, 14.50880606, 
14.69793299, 13.96309811, 14.18825383, 15.32530354, 14.38700954, 
13.98545482, 14.9828434), PakERA = c(14.89773191, 14.87337075, 
14.89837223, 14.76236826, 15.13918051, 14.61385609, 14.54589641, 
15.40150813, 14.8588883, 14.62185208, 15.6575491), SAERA = c(14.38877, 
14.40468, 14.22069, 14.20561, 14.35855, 13.8399, 13.88027, 14.83054, 
14.24554, 14.06201, 15.09615), BanTOM = c(9.317937851, 9.308046341, 
9.327401161, 9.319338799, 9.311285019, 9.300317764, 9.292790413, 
9.540946007, 9.04840374, 9.300317764, 9.317937851), SriTOM = c(5.437336445, 
5.436554909, 5.435492039, 5.440690994, 5.436693192, 5.440601349, 
5.427892685, 5.54946661, 5.427827358, 5.440601349, 5.437336445
), AfgTOM = c(13.31581736, 13.30339324, 13.30090284, 13.29781604, 
13.33800817, 13.31919873, 13.30073023, 13.62503445, 13.16488469, 
13.30073023, 13.31581736), BhuTOM = c(11.69911337, 11.67375898, 
11.71142626, 11.69099903, 11.68556881, 11.68714046, 11.65387106, 
11.97064924, 11.44872904, 11.67375898, 11.69099903), IndTOm = c(9.709311898, 
9.704142364, 9.703938368, 9.72520479, 9.709638531, 9.708799697, 
9.690851817, 9.952517961, 9.499369441, 9.704142364, 9.709638531
), NepTOM = c(12.45835066, 12.43677187, 12.48850822, 12.49002218, 
12.46283817, 12.50376368, 12.44072294, 12.78685617, 12.27891684, 
12.44072294, 12.46283817), PakTOM = c(12.37911913, 12.38028261, 
12.37067625, 12.38315158, 12.38352468, 12.36856567, 12.37349086, 
12.67422019, 12.18962786, 12.37349086, 12.38315158), SATOM = c(10.63543, 
10.62967, 10.62981, 10.64489, 10.63941, 10.63525, 10.6195, 10.89613, 
10.44028, 10.62967, 10.63941)), class = "data.frame", row.names = c(NA, 
-11L))

【讨论】：

谢谢大家我只是在检查结果，似乎第一种方法给出了 OLS 的正确结果，而不是第二种方法。第一种方法的 BanTES 的斜率值为 0.138595178（正确），而第二种方法给出的值为 0.089176876，这是不正确的。其实第一种方法是对的。但是为了与您的问题保持相似性，我展示了第二种方法。您可以根据需要获取输出。【参考方案2】：

您可以使用glance 代替tidy 作为p 值：

ols <- DF %>% 
  nest(data = -variable) %>% 
  mutate(model  = map(data, ~lm(value ~ Year, data = .)), 
         tidied = map(model, glance)) %>% 
  unnest(tidied)

> ols
# A tibble: 24 x 15
   variable data              model  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual  nobs
   <fct>    <list>            <list>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
 1 BanTES   <tibble [11 x 2]> <lm>     0.0350        -0.0722 2.54     0.327    0.582     1 -24.8   55.5  56.7    58.2            9    11
 2 SriTES   <tibble [11 x 2]> <lm>     0.0309        -0.0767 1.83     0.287    0.605     1 -21.2   48.3  49.5    30.1            9    11
 3 AfgTES   <tibble [11 x 2]> <lm>     0.0559        -0.0490 2.32     0.533    0.484     1 -23.7   53.5  54.7    48.2            9    11
 4 BhuTES   <tibble [11 x 2]> <lm>     0.0185        -0.0905 4.33     0.170    0.690     1 -30.6   67.3  68.4   169.             9    11
 5 IndTES   <tibble [11 x 2]> <lm>     0.264          0.183  1.11     3.23     0.106     1 -15.7   37.3  38.5    11.1            9    11
 6 NepTES   <tibble [11 x 2]> <lm>     0.0689        -0.0345 2.49     0.666    0.435     1 -24.5   55.0  56.2    55.6            9    11
 7 PakTES   <tibble [11 x 2]> <lm>     0.243          0.159  0.872    2.89     0.123     1 -13.0   32.0  33.2     6.84           9    11
 8 SATES    <tibble [11 x 2]> <lm>     0.221          0.135  0.625    2.56     0.144     1  -9.34  24.7  25.9     3.52           9    11
 9 BanERA   <tibble [11 x 2]> <lm>     0.0252        -0.0831 0.482    0.233    0.641     1  -6.49  19.0  20.2     2.09           9    11
10 SriERA   <tibble [11 x 2]> <lm>     0.00230       -0.109  0.416    0.0208   0.889     1  -4.87  15.7  16.9     1.56           9    11

【讨论】：

谢谢，但是斜率值在哪里？它是什么统计值？是 Z 统计量还是 t 统计量？如果它不是统计数据，那么为什么您估计的值与我得出的值不匹配？如何使用系数、z 统计量和 p 值将结果保存为 excel 格式？ @Bappa Das 和 @Roland 回答了 Z-stat 和 t-stat 问题。对于斜率和截距，您可以返回 tidy 而不是 glance。为了保存结果，我通常更喜欢write.csv2 函数。【参考方案3】：

关于将数据保存到 Excel，如果您的数据在 tibble 或 data.frame 中，则可以使用该函数：

readr::write_csv 将数据保存为 CSV 文件，可以在 Excel 中轻松打开； writexl::write_xlsx 将数据保存为原生 Excel 文件。

这两个函数都会将数据中的所有行和列保存到文件中。

【讨论】：

以上是关于查找 p 值和 z 统计量以及 OLS 线性回归的主要内容，如果未能解决你的问题，请参考以下文章

Mann -whitney 检验时有统计量U值和Z值，发表文章时用啥值比较好

R语言普通最小二乘(OLS)回归说明以及构建普通最小二乘(OLS)回归需要满足的四个假设（Normality（正态性）Independence（独立性）Linearity（线性度）方差齐性）

线性回归--假设检验（F统计量P-value）

spss 多元线性回归分析帮忙分析一下下图，F、P、t、p和r方各代表啥？？谢谢~