变量选择——lassoSCADMCP的实现(R语言)
Posted 统计学小王子
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了变量选择——lassoSCADMCP的实现(R语言)相关的知识,希望对你有一定的参考价值。
0引言
自1996年lasso被提出以来,很多学者针对不同的模型提出有效的算法进行计算,例如多元线性线性模型、cox回归模型、广义线性模型等。也给出了很多选择惩罚参数的方式,比如cv、aic、bic。还有很多惩罚类型:lasso、适应性lasso、弹性网、SCAD、MCP。
本文主要介绍下面三个包:glmnet、ncvreg、msaenet
。先汇总每个包的主要函数、方法。如下表:
函数 | 模型 | 惩罚参数 | 惩罚类型 | 包 | 是否弹性网 |
---|---|---|---|---|---|
glmnet | “gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian” | — | lasso | glmnet | 是 |
cv.glmnet | “gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian” | CV | lasso | glmnet | 是 |
aenet | “gaussian”, “binomial”, “poisson”, “cox” | “cv”, “ebic”, “bic”, “aic” | Adaptive lasso | msaenet | 是 |
asnet | “gaussian”, “binomial”, “poisson”, “cox” | “cv”, “ebic”, “bic”, “aic” | SCAD | msaenet | 是 |
amnet | “gaussian”, “binomial”, “poisson”, “cox” | “cv”, “ebic”, “bic”, “aic” | MCP | msaenet | 是 |
ncvreg | “gaussian”, “binomial”, “poisson” | — | “MCP”, “SCAD”, “lasso” | ncvreg | 否 |
cv.ncvreg | “gaussian”, “binomial”, “poisson” | CV | “MCP”, “SCAD”, “lasso” | ncvreg | 否 |
做lasso类变量选择的包有很多,上表只列出了三个我常用的。有知道更多包的大佬欢迎评论区留言或私信给我们安利,大家一块扩充知识。
上表给出了包和函数主要的参数和模型,大部分熟悉R语言的可以很容易的安装实现自己想要的模型。但是对于初学者可能有些困难,为了大部分人快速入手上述包。下面生成合适的模型数据,以高斯的误差和lasso(或者Adaptive lasso)为例,使用CV选择作为选择惩罚因子的方法给出实现例子。
1、glmnet
1.1生成数据
本文使用的数据都是本节生成的,下面是生成代码:
> # R版本 3.6.3
> Beta <- cbind(c(1,1,1,1,0,0)) # 真是的beta
> n <- 500 # 样本量
> set.seed(1234) # 为了本文的复现,设置随机种子
> x1 <- rnorm(n)
> x2 <- rnorm(n)
> x3 <- rnorm(n)
> x4 <- rnorm(n)
> x5 <- rnorm(n)
> x6 <- rnorm(n)
> x <- cbind(x1, x2, x3, x4, x5, x6)
> y <- x %*% Beta + rnorm(n, 0, 0.5)
>
> # 数据展示
> head(x, 10)
x1 x2 x3 x4 x5 x6
[1,] -1.2070657 0.98477997 -1.2053334 1.7940745 -0.9738186 -1.3662376
[2,] 0.2774292 -1.22473788 0.3014667 -1.3645489 -0.0996312 0.5392093
[3,] 1.0844412 0.70972622 -1.5391452 -0.7074400 -0.1107350 -1.3219320
[4,] -2.3456977 -0.10921999 0.6353707 -0.5562843 1.1921946 -0.2812887
[5,] 0.4291247 1.78260790 0.7029518 -0.3100811 -1.6558859 -2.1049469
[6,] 0.5060559 -0.24344468 -1.9058829 -0.3761793 -1.0456433 -1.6176047
[7,] -0.5747400 -1.52710702 0.9389214 0.4585260 -1.7402391 -0.7237319
[8,] -0.5466319 0.49183437 -0.2244921 -1.2611491 0.5131208 0.3067410
[9,] -0.5644520 0.35450366 -0.6738168 -0.5274652 -0.4459568 0.2255962
[10,] -0.8900378 -0.01762635 0.4457874 -0.5568142 -1.8391938 0.9357160
> head(y, 10)
[,1]
[1,] 0.1739191
[2,] -1.7533630
[3,] -0.2984157
[4,] -1.4562544
[5,] 3.4013056
[6,] -2.1985494
[7,] -1.2608381
[8,] -1.2924795
[9,] -2.0985074
[10,] -0.7405859
1.2 glmnet::cv.glmnet
> (fit <- glmnet::cv.glmnet(x, y))
Call: glmnet::cv.glmnet(x = x, y = y)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.01019 52 0.2512 0.009655 5
1se 0.05439 34 0.2605 0.010606 4
> summary(fit)
Length Class Mode
lambda 57 -none- numeric
cvm 57 -none- numeric
cvsd 57 -none- numeric
cvup 57 -none- numeric
cvlo 57 -none- numeric
nzero 57 -none- numeric
call 3 -none- call
name 1 -none- character
glmnet.fit 12 elnet list
lambda.min 1 -none- numeric
lambda.1se 1 -none- numeric
index 2 -none- numeric
> str(fit)
List of 12
$ lambda : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
$ cvm : num [1:57] 4.46 4.23 3.79 3.23 2.73 ...
$ cvsd : num [1:57] 0.369 0.376 0.364 0.323 0.272 ...
$ cvup : num [1:57] 4.83 4.6 4.16 3.56 3 ...
$ cvlo : num [1:57] 4.09 3.85 3.43 2.91 2.45 ...
$ nzero : Named int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
$ call : language glmnet::cv.glmnet(x = x, y = y)
$ name : Named chr "Mean-Squared Error"
..- attr(*, "names")= chr "mse"
$ glmnet.fit:List of 12
..$ a0 : Named num [1:57] -0.000854 -0.000831 0.004037 0.005898 0.007593 ...
.. ..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
..$ beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..@ i : int [1:236] 0 1 0 1 2 3 0 1 2 3 ...
.. .. ..@ p : int [1:58] 0 0 2 6 10 14 18 22 26 30 ...
.. .. ..@ Dim : int [1:2] 6 57
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
.. .. .. ..$ : chr [1:57] "s0" "s1" "s2" "s3" ...
.. .. ..@ x : num [1:236] 0.10041 0.00378 0.18545 0.09365 0.0017 ...
.. .. ..@ factors : list()
..$ df : int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
..$ dim : int [1:2] 6 57
..$ lambda : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
..$ dev.ratio: num [1:57] 0 0.0537 0.1566 0.2905 0.4016 ...
..$ nulldev : num 2235
..$ npasses : int 206
..$ jerr : int 0
..$ offset : logi FALSE
..$ call : language glmnet(x = x, y = y)
..$ nobs : int 500
..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
$ lambda.min: num 0.0102
$ lambda.1se: num 0.0544
$ index : int [1:2, 1] 52 34
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "min" "1se"
.. ..$ : chr "Lambda"
- attr(*, "class")= chr "cv.glmnet"
> coef(fit)
7 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.02380925
x1 0.94151352
x2 0.98011080
x3 0.94789578
x4 0.93311113
x5 .
x6 .
>
2、msaenet
> fit <- msaenet::aenet(x, y)
> summary(fit)
Length Class Mode
beta 6 dgCMatrix S4
model 12 elnet list
beta.first 6 dgCMatrix S4
model.first 12 elnet list
best.alpha.enet 1 -none- numeric
best.alpha.aenet 1 -none- numeric
best.lambda.enet 1 -none- numeric
best.lambda.aenet 1 -none- numeric
step.criterion 2 -none- numeric
adpen 6 -none- numeric
seed 1 -none- numeric
call 3 -none- call
> str(fit)
List of 12
$ beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:4] 0 1 2 3
.. ..@ p : int [1:2] 0 4
.. ..@ Dim : int [1:2] 6 1
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
.. .. ..$ : chr "s0"
.. ..@ x : num [1:4] 0.98 1.026 0.997 0.978
.. ..@ factors : list()
$ model :List of 12
..$ a0 : Named num 0.0248
.. ..- attr(*, "names")= chr "s0"
..$ beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..@ i : int [1:4] 0 1 2 3
.. .. ..@ p : int [1:2] 0 4
.. .. ..@ Dim : int [1:2] 6 1
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
.. .. .. ..$ : chr "s0"
.. .. ..@ x : num [1:4] 0.98 1.026 0.997 0.978
.. .. ..@ factors : list()
..$ df : int 4
..$ dim : int [1:2] 6 1
..$ lambda : num 1.11e+13
..$ dev.ratio: num 0.945
..$ nulldev : num 2235
..$ npasses : int 5
..$ jerr : int 0
..$ offset : logi FALSE
..$ call : language glmnet(x = x, y = y, family = family, alpha = best.alpha.aenet, lambda = best.lambda.aenet, penalty.factor =| __truncated__
..$ nobs : int 500
..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
$ beta.first :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:5] 0 1 2 3 5
.. ..@ p : int [1:2] 0 5
.. ..@ Dim : int [1:2] 6 1
.. ..@ Dimnames:List of 2
.. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
.. .. ..$ : chr "s0"
.. ..@ x : num [1:5] 0.98 1.027 0.999 0.978 -0.017
.. ..@ factors : list()
$ model.first :List of 12
..$ a0 : Named num 0.025
.. ..- attr(*, "names")= chr "s0"
..$ beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. .. ..@ i : int [1:5] 0 1 2 3 5
.. .. ..@ p : int [1:2] 0 5
.. .. ..@ Dim : int [1:2] 6 1
.. .. ..@ Dimnames:List of 2
.. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
.. .. .. ..$ : chr "s0"
.. .. ..@ x : num [1:5] 0.98 1.027 0.999 0.978 -0.017
.. .. ..@ factors : list()
..$ df : int 5
..$ dim : int [1:2] 6 1
..$ lambda : num 0.00686
..$ dev.ratio: num 0.945
..$ nulldev : num 2235
..$ npasses : int 5
..$ jerr : int 0
..$ offset : logi FALSE
..$ call : language glmnet(x = x, y = y, family = family, alpha = best.alpha.enet, lambda = best.lambda.enet, penalty.factor = p| __truncated__ ...
..$ nobs : int 500
..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
$ best.alpha.enet : num 0.85
$ best.alpha.aenet : num 0.05
$ best.lambda.enet : num 0.00686
$ best.lambda.aenet: num 1.11e+13
$ step.criterion : num [1:2] 0.501 0.502
$ adpen : num [1:6, 1] 1.02 9.73e-01 1.00 1.02 4.50e+15 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
.. ..$ : chr "s0"
$ seed : num 1001
$ call : language msaenet::aenet(x = x, y = y)
- attr(*, "class")= chr [1:2] "msaenet" "msaenet.aenet"
> coef(fit)
[1] 0.9799217 1.0258625 0.9968319 0.9781661 0.0000000 0.0000000
3、ncvreg
> fit <- ncvreg::cv.ncvreg(x, y)
> summary(fit)
MCP-penalized linear regression with n=500, p=6
At minimum cross-validation error (lambda=0.1781):
-------------------------------------------------
Nonzero coefficients: 4
Cross-validation error (deviance): 0.25
R-squared: 0.94
Signal-to-noise ratio: 16.69
Scale estimate (sigma): 0.503
> str(fit)
List of 9
$ cve : num [1:100] 4.45 4.25 3.91 3.36 2.76 ...
$ cvse : num [1:100] 0.278 0.267 0.246 0.211 0.174 ...
$ fold : num [1:500] 2 2 5 R语言 | 多元回归中常见的变量选择方法