查找 p 值和 z 统计量以及 OLS 线性回归
Posted
技术标签:
【中文标题】查找 p 值和 z 统计量以及 OLS 线性回归【英文标题】:Finding p-value and z statistics along with the OLS Linear regression 【发布时间】:2021-11-26 11:33:32 【问题描述】:我可以从线性回归中找到系数和截距,但找不到合适的方法来获取相应变量趋势的 p 值和 z 值。此外,无法找到将输出结果保存为 excel 格式的方法。数据为here。时间有 24 个变量。我没有得到 z 统计量和 p 值,此外,第一种方法的估计也是不正确的。我哪里错了?
library("trend")
# read ozone data (I converted to a text file first)
otm <- read.table("D:/data.txt",header=T)
# make a data frame version
otm_df <- data.frame(otm)
markers <- sample(0:1, replace = T, size = 11)
# calculate OLS slope for all columns
# the -1 at end removes the intercepts
ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])
我试过这个方法。我没有得到 z 统计数据,也无法将其保存为 excel 格式。
library(reshape2)
DF <- reshape2::melt(otm, id.var = "Year")
library(broom); library(tidyverse)
ols <- DF %>% nest(data = -variable) %>%
mutate(model = map(data, ~lm(value ~ Year, data = .)),
tidied = map(model, tidy)) %>%
unnest(tidied)
#to save the results in excel format (not working here for me)
capture.output(summary(ols), file = "ols.csv" )
write.csv(ols, file.path('E:/',filename = "ols2.csv"), row.names = TRUE)
# A tibble: 48 x 8
variable data model term estimate std.error statistic p.value
<fct> <list> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 BanTES <tibble [11 x 2]> <lm> (Intercept) -236. 488. -0.483 0.641
2 BanTES <tibble [11 x 2]> <lm> Year 0.139 0.242 0.572 0.582
3 SriTES <tibble [11 x 2]> <lm> (Intercept) 220. 351. 0.627 0.546
4 SriTES <tibble [11 x 2]> <lm> Year -0.0935 0.174 -0.536 0.605
5 AfgTES <tibble [11 x 2]> <lm> (Intercept) 364. 444. 0.820 0.434
6 AfgTES <tibble [11 x 2]> <lm> Year -0.161 0.221 -0.730 0.484
7 BhuTES <tibble [11 x 2]> <lm> (Intercept) 373. 831. 0.449 0.664
8 BhuTES <tibble [11 x 2]> <lm> Year -0.170 0.413 -0.412 0.690
9 IndTES <tibble [11 x 2]> <lm> (Intercept) -342. 213. -1.60 0.143
10 IndTES <tibble [11 x 2]> <lm> Year 0.190 0.106 1.80 0.106
summary(ols)
variable data.Length data.Class data.Mode model.Length model.Class model.Mode term
BanTES : 2 2 tbl_df list 12 lm list Length:48
SriTES : 2 2 tbl_df list 12 lm list Class :character
AfgTES : 2 2 tbl_df list 12 lm list Mode :character
BhuTES : 2 2 tbl_df list 12 lm list
IndTES : 2 2 tbl_df list 12 lm list
NepTES : 2 2 tbl_df list 12 lm list
(Other):36 2 tbl_df list 12 lm list
任何帮助都会很有用。提前谢谢你!
【问题讨论】:
help("summary.lm")
没有关于 p 值和 z 统计的内容
有(当然它是 t 统计量而不是 z 统计量,因为它应该是)。
我用 excel 检查了估算值。 excel和R软件的OLS斜率值不匹配。为什么会这样?
请provide a minimal reproducible example。
【参考方案1】:
线性回归不会像@Roland 正确评论的那样为您提供 Z 统计量,而是线性回归为您提供 t 统计量。您可以使用以下代码将coeff、t统计量和p值保存为excel格式
library(tidyverse)
library(broom)
library(openxlsx)
# read ozone data (I converted to a text file first)
otm <- read.table("data.txt", header=T)
head(otm, 2)
otm %>%
pivot_longer(-c("Year")) %>%
group_by(name) %>%
do(fitlm = tidy(lm(value ~ Year, data = ., na.action=na.omit))) %>%
unnest(fitlm) %>%
write.xlsx("Linear_trend_1.xlsx")
在“Linear_trend_1.xlsx”文件中,每个位置的年份行为您提供斜率,统计量为 t 统计量。
第一种方法的值与第二种方法不匹配,因为第一种方法使用markers
作为因变量,而otm_df
中的所有变量都用作自变量。在第二种方法中,Year
被用作自变量,而所有位置值都是因变量。
另一件事,您正在使用sample
函数创建markers
,这将为您每次运行提供不同的输出。所以你必须使用set.seed
让它在每次运行时都保持不变
set.seed(123)
otm %>%
mutate(markers = sample(0:1, replace = T, size = 11)) %>%
pivot_longer(-c("Year", "markers")) %>%
group_by(name) %>%
do(fitlm = tidy(lm(markers ~ value, data = ., na.action=na.omit))) %>%
unnest(fitlm) %>%
write.xlsx("Linear_trend_2.xlsx")
# A tibble: 48 x 6
name term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AfgERA (Intercept) -8.58 7.79 -1.10 0.299
2 AfgERA value 0.577 0.498 1.16 0.276
3 AfgTES (Intercept) -1.62 2.99 -0.542 0.601
4 AfgTES value 0.0521 0.0750 0.695 0.505
5 AfgTOM (Intercept) -25.3 19.5 -1.30 0.225
6 AfgTOM value 1.94 1.46 1.33 0.218
7 BanERA (Intercept) -2.28 5.03 -0.453 0.662
8 BanERA value 0.201 0.370 0.543 0.600
9 BanTES (Intercept) -3.42 2.80 -1.22 0.253
10 BanTES value 0.0892 0.0644 1.39 0.199
# ... with 38 more rows
如果我像下面这样修改你的第一种方法,你可以看到上面代码的输出和你的代码基本相同
# make a data frame version
otm_df <- data.frame(otm)
set.seed(123) #To have same output from sample function for every run
markers <- sample(0:1, replace = T, size = 11)
# calculate OLS slope for all columns
# the -1 at end removes the intercepts
ols <- sapply(otm_df, function(x) coef(lm(markers ~ x))[-1])
Year.x BanTES.x SriTES.x AfgTES.x BhuTES.x IndTES.x
0.054545455 0.089176876 0.092629314 0.052132553 0.001602003 0.107446434
NepTES.x PakTES.x SATES.x BanERA.x SriERA.x AfgERA.x
0.065125607 -0.115108438 0.315928753 0.200712285 -0.519996739 0.577317012
BhuERA.x IndERA.x NepERA.x PakERA.x SAERA.x BanTOM.x
-0.140990921 0.110578204 -0.030546686 0.265909319 0.176797510 1.897234338
SriTOM.x AfgTOM.x BhuTOM.x IndTOm.x NepTOM.x PakTOM.x
5.380610477 1.935170281 1.761172711 2.248531891 2.107380452 2.011356580
SATOM.x
2.214848959
这是dput
格式的数据
dput(otm)
structure(list(Year = 2008:2018, BanTES = c(44.06247376, 43.81239107,
40.68010622, 46.97760506, 37.49591135, 43.81239107, 43.81239107,
43.81239107, 43.81239107, 45.27803189, 44.06247376), SriTES = c(32.07268265,
35.01918477, 29.91018035, 34.1291577, 28.5258431, 32.07268265,
32.07268265, 32.07268265, 32.07268265, 30.96753552, 32.07268265
), AfgTES = c(39.19328867, 42.06325898, 42.31015918, 40.54543762,
34.28696385, 40.54543762, 40.54543762, 40.54543762, 40.54543762,
37.38974643, 39.19328867), BhuTES = c(34.08241824, 36.95440954,
30.41561338, 30.37004394, 19.8861367, 30.41561338, 30.41561338,
30.41561338, 30.41561338, 32.09933763, 32.09933763), IndTES = c(40.05913352,
41.54741392, 38.88957844, 42.47544504, 43.24350644, 41.54741392,
41.54741392, 41.54741392, 41.54741392, 42.65820983, 42.47544504
), NepTES = c(38.12979871, 37.62785275, 34.40488247, 37.7995467,
39.64286364, 37.7995467, 37.7995467, 37.7995467, 37.7995467,
30.63632105, 37.7995467), PakTES = c(41.38388734, 41.99865359,
42.16236093, 42.51838941, 43.4444952, 42.16236093, 42.16236093,
42.16236093, 42.16236093, 44.96627251, 42.51838941), SATES = c(40.03077,
41.52302, 39.6327, 41.9098, 41.11191, 41.11191, 41.11191, 41.11191,
41.11191, 41.57009, 41.52302), BanERA = c(13.76686693, 13.904453,
13.40584856, 13.45199721, 13.47657436, 12.8992102, 13.3586098,
14.23223365, 13.4228729, 13.21487616, 14.50830571), SriERA = c(11.81852768,
11.79187354, 11.51484349, 11.50552588, 11.489789, 11.23384852,
10.61182708, 11.33951759, 11.6357584, 11.74685028, 12.14987906
), AfgERA = c(15.44115983, 15.425995, 15.6161623, 15.47751927,
15.81748069, 15.47498417, 15.41748855, 16.06462541, 15.61143062,
15.32810621, 16.39162424), BhuERA = c(14.34493453, 14.28085419,
14.24728543, 14.03202106, 14.04152053, 13.42977221, 13.22665229,
14.344052, 13.58792484, 13.28851619, 14.28029524), IndERA = c(14.08262362,
14.11485037, 13.80713493, 13.86114379, 13.92607879, 13.37996473,
13.45767152, 14.49365275, 13.88142768, 13.73986257, 14.77032404
), NepERA = c(14.93883379, 14.896056, 14.78607828, 14.50880606,
14.69793299, 13.96309811, 14.18825383, 15.32530354, 14.38700954,
13.98545482, 14.9828434), PakERA = c(14.89773191, 14.87337075,
14.89837223, 14.76236826, 15.13918051, 14.61385609, 14.54589641,
15.40150813, 14.8588883, 14.62185208, 15.6575491), SAERA = c(14.38877,
14.40468, 14.22069, 14.20561, 14.35855, 13.8399, 13.88027, 14.83054,
14.24554, 14.06201, 15.09615), BanTOM = c(9.317937851, 9.308046341,
9.327401161, 9.319338799, 9.311285019, 9.300317764, 9.292790413,
9.540946007, 9.04840374, 9.300317764, 9.317937851), SriTOM = c(5.437336445,
5.436554909, 5.435492039, 5.440690994, 5.436693192, 5.440601349,
5.427892685, 5.54946661, 5.427827358, 5.440601349, 5.437336445
), AfgTOM = c(13.31581736, 13.30339324, 13.30090284, 13.29781604,
13.33800817, 13.31919873, 13.30073023, 13.62503445, 13.16488469,
13.30073023, 13.31581736), BhuTOM = c(11.69911337, 11.67375898,
11.71142626, 11.69099903, 11.68556881, 11.68714046, 11.65387106,
11.97064924, 11.44872904, 11.67375898, 11.69099903), IndTOm = c(9.709311898,
9.704142364, 9.703938368, 9.72520479, 9.709638531, 9.708799697,
9.690851817, 9.952517961, 9.499369441, 9.704142364, 9.709638531
), NepTOM = c(12.45835066, 12.43677187, 12.48850822, 12.49002218,
12.46283817, 12.50376368, 12.44072294, 12.78685617, 12.27891684,
12.44072294, 12.46283817), PakTOM = c(12.37911913, 12.38028261,
12.37067625, 12.38315158, 12.38352468, 12.36856567, 12.37349086,
12.67422019, 12.18962786, 12.37349086, 12.38315158), SATOM = c(10.63543,
10.62967, 10.62981, 10.64489, 10.63941, 10.63525, 10.6195, 10.89613,
10.44028, 10.62967, 10.63941)), class = "data.frame", row.names = c(NA,
-11L))
【讨论】:
谢谢大家 我只是在检查结果,似乎第一种方法给出了 OLS 的正确结果,而不是第二种方法。第一种方法的 BanTES 的斜率值为 0.138595178(正确),而第二种方法给出的值为 0.089176876,这是不正确的。 其实第一种方法是对的。但是为了与您的问题保持相似性,我展示了第二种方法。您可以根据需要获取输出。【参考方案2】:您可以使用glance
代替tidy
作为p 值:
ols <- DF %>%
nest(data = -variable) %>%
mutate(model = map(data, ~lm(value ~ Year, data = .)),
tidied = map(model, glance)) %>%
unnest(tidied)
> ols
# A tibble: 24 x 15
variable data model r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 BanTES <tibble [11 x 2]> <lm> 0.0350 -0.0722 2.54 0.327 0.582 1 -24.8 55.5 56.7 58.2 9 11
2 SriTES <tibble [11 x 2]> <lm> 0.0309 -0.0767 1.83 0.287 0.605 1 -21.2 48.3 49.5 30.1 9 11
3 AfgTES <tibble [11 x 2]> <lm> 0.0559 -0.0490 2.32 0.533 0.484 1 -23.7 53.5 54.7 48.2 9 11
4 BhuTES <tibble [11 x 2]> <lm> 0.0185 -0.0905 4.33 0.170 0.690 1 -30.6 67.3 68.4 169. 9 11
5 IndTES <tibble [11 x 2]> <lm> 0.264 0.183 1.11 3.23 0.106 1 -15.7 37.3 38.5 11.1 9 11
6 NepTES <tibble [11 x 2]> <lm> 0.0689 -0.0345 2.49 0.666 0.435 1 -24.5 55.0 56.2 55.6 9 11
7 PakTES <tibble [11 x 2]> <lm> 0.243 0.159 0.872 2.89 0.123 1 -13.0 32.0 33.2 6.84 9 11
8 SATES <tibble [11 x 2]> <lm> 0.221 0.135 0.625 2.56 0.144 1 -9.34 24.7 25.9 3.52 9 11
9 BanERA <tibble [11 x 2]> <lm> 0.0252 -0.0831 0.482 0.233 0.641 1 -6.49 19.0 20.2 2.09 9 11
10 SriERA <tibble [11 x 2]> <lm> 0.00230 -0.109 0.416 0.0208 0.889 1 -4.87 15.7 16.9 1.56 9 11
【讨论】:
谢谢,但是斜率值在哪里?它是什么统计值?是 Z 统计量还是 t 统计量?如果它不是统计数据,那么为什么您估计的值与我得出的值不匹配?如何使用系数、z 统计量和 p 值将结果保存为 excel 格式? @Bappa Das 和 @Roland 回答了 Z-stat 和 t-stat 问题。对于斜率和截距,您可以返回tidy
而不是 glance
。为了保存结果,我通常更喜欢write.csv2
函数。【参考方案3】:
关于将数据保存到 Excel,如果您的数据在 tibble 或 data.frame 中,则可以使用该函数:
readr::write_csv
将数据保存为 CSV 文件,可以在 Excel 中轻松打开;
writexl::write_xlsx
将数据保存为原生 Excel 文件。
这两个函数都会将数据中的所有行和列保存到文件中。
【讨论】:
以上是关于查找 p 值和 z 统计量以及 OLS 线性回归的主要内容,如果未能解决你的问题,请参考以下文章
Mann -whitney 检验时有统计量U值和Z值,发表文章时用啥值比较好
R语言普通最小二乘(OLS)回归说明以及构建普通最小二乘(OLS)回归需要满足的四个假设(Normality(正态性)Independence(独立性)Linearity(线性度)方差齐性)