变量选择——lassoSCADMCP的实现（R语言）

Posted 2021-10-24 统计学小王子

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了变量选择——lassoSCADMCP的实现（R语言）相关的知识，希望对你有一定的参考价值。

0引言

自1996年lasso被提出以来，很多学者针对不同的模型提出有效的算法进行计算，例如多元线性线性模型、cox回归模型、广义线性模型等。也给出了很多选择惩罚参数的方式，比如cv、aic、bic。还有很多惩罚类型：lasso、适应性lasso、弹性网、SCAD、MCP。
本文主要介绍下面三个包：glmnet、ncvreg、msaenet。先汇总每个包的主要函数、方法。如下表：

函数	模型	惩罚参数	惩罚类型	包	是否弹性网
glmnet	“gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian”	—	lasso	glmnet	是
cv.glmnet	“gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian”	CV	lasso	glmnet	是
aenet	“gaussian”, “binomial”, “poisson”, “cox”	“cv”, “ebic”, “bic”, “aic”	Adaptive lasso	msaenet	是
asnet	“gaussian”, “binomial”, “poisson”, “cox”	“cv”, “ebic”, “bic”, “aic”	SCAD	msaenet	是
amnet	“gaussian”, “binomial”, “poisson”, “cox”	“cv”, “ebic”, “bic”, “aic”	MCP	msaenet	是
ncvreg	“gaussian”, “binomial”, “poisson”	—	“MCP”, “SCAD”, “lasso”	ncvreg	否
cv.ncvreg	“gaussian”, “binomial”, “poisson”	CV	“MCP”, “SCAD”, “lasso”	ncvreg	否

做lasso类变量选择的包有很多，上表只列出了三个我常用的。有知道更多包的大佬欢迎评论区留言或私信给我们安利，大家一块扩充知识。
上表给出了包和函数主要的参数和模型，大部分熟悉R语言的可以很容易的安装实现自己想要的模型。但是对于初学者可能有些困难，为了大部分人快速入手上述包。下面生成合适的模型数据，以高斯的误差和lasso（或者Adaptive lasso）为例，使用CV选择作为选择惩罚因子的方法给出实现例子。

1、glmnet

1.1生成数据

本文使用的数据都是本节生成的，下面是生成代码：

> # R版本 3.6.3
> Beta <- cbind(c(1,1,1,1,0,0)) # 真是的beta
> n <- 500  # 样本量
> set.seed(1234)  # 为了本文的复现，设置随机种子
> x1 <- rnorm(n)
> x2 <- rnorm(n)
> x3 <- rnorm(n)
> x4 <- rnorm(n)
> x5 <- rnorm(n)
> x6 <- rnorm(n)
> x <- cbind(x1, x2, x3, x4, x5, x6)
> y <- x %*% Beta + rnorm(n, 0, 0.5)
> 
> # 数据展示
> head(x, 10)
              x1          x2         x3         x4         x5         x6
 [1,] -1.2070657  0.98477997 -1.2053334  1.7940745 -0.9738186 -1.3662376
 [2,]  0.2774292 -1.22473788  0.3014667 -1.3645489 -0.0996312  0.5392093
 [3,]  1.0844412  0.70972622 -1.5391452 -0.7074400 -0.1107350 -1.3219320
 [4,] -2.3456977 -0.10921999  0.6353707 -0.5562843  1.1921946 -0.2812887
 [5,]  0.4291247  1.78260790  0.7029518 -0.3100811 -1.6558859 -2.1049469
 [6,]  0.5060559 -0.24344468 -1.9058829 -0.3761793 -1.0456433 -1.6176047
 [7,] -0.5747400 -1.52710702  0.9389214  0.4585260 -1.7402391 -0.7237319
 [8,] -0.5466319  0.49183437 -0.2244921 -1.2611491  0.5131208  0.3067410
 [9,] -0.5644520  0.35450366 -0.6738168 -0.5274652 -0.4459568  0.2255962
[10,] -0.8900378 -0.01762635  0.4457874 -0.5568142 -1.8391938  0.9357160
> head(y, 10)
            [,1]
 [1,]  0.1739191
 [2,] -1.7533630
 [3,] -0.2984157
 [4,] -1.4562544
 [5,]  3.4013056
 [6,] -2.1985494
 [7,] -1.2608381
 [8,] -1.2924795
 [9,] -2.0985074
[10,] -0.7405859

1.2 glmnet::cv.glmnet

> (fit <- glmnet::cv.glmnet(x, y))

Call:  glmnet::cv.glmnet(x = x, y = y) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.01019    52  0.2512 0.009655       5
1se 0.05439    34  0.2605 0.010606       4
> summary(fit)
           Length Class  Mode     
lambda     57     -none- numeric  
cvm        57     -none- numeric  
cvsd       57     -none- numeric  
cvup       57     -none- numeric  
cvlo       57     -none- numeric  
nzero      57     -none- numeric  
call        3     -none- call     
name        1     -none- character
glmnet.fit 12     elnet  list     
lambda.min  1     -none- numeric  
lambda.1se  1     -none- numeric  
index       2     -none- numeric  
> str(fit)
List of 12
 $ lambda    : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
 $ cvm       : num [1:57] 4.46 4.23 3.79 3.23 2.73 ...
 $ cvsd      : num [1:57] 0.369 0.376 0.364 0.323 0.272 ...
 $ cvup      : num [1:57] 4.83 4.6 4.16 3.56 3 ...
 $ cvlo      : num [1:57] 4.09 3.85 3.43 2.91 2.45 ...
 $ nzero     : Named int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
  ..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
 $ call      : language glmnet::cv.glmnet(x = x, y = y)
 $ name      : Named chr "Mean-Squared Error"
  ..- attr(*, "names")= chr "mse"
 $ glmnet.fit:List of 12
  ..$ a0       : Named num [1:57] -0.000854 -0.000831 0.004037 0.005898 0.007593 ...
  .. ..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:236] 0 1 0 1 2 3 0 1 2 3 ...
  .. .. ..@ p       : int [1:58] 0 0 2 6 10 14 18 22 26 30 ...
  .. .. ..@ Dim     : int [1:2] 6 57
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr [1:57] "s0" "s1" "s2" "s3" ...
  .. .. ..@ x       : num [1:236] 0.10041 0.00378 0.18545 0.09365 0.0017 ...
  .. .. ..@ factors : list()
  ..$ df       : int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
  ..$ dim      : int [1:2] 6 57
  ..$ lambda   : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
  ..$ dev.ratio: num [1:57] 0 0.0537 0.1566 0.2905 0.4016 ...
  ..$ nulldev  : num 2235
  ..$ npasses  : int 206
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y)
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ lambda.min: num 0.0102
 $ lambda.1se: num 0.0544
 $ index     : int [1:2, 1] 52 34
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "min" "1se"
  .. ..$ : chr "Lambda"
 - attr(*, "class")= chr "cv.glmnet"
> coef(fit)
7 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 0.02380925
x1          0.94151352
x2          0.98011080
x3          0.94789578
x4          0.93311113
x5          .         
x6          .         
>

2、msaenet

> fit <- msaenet::aenet(x, y)
> summary(fit)
                  Length Class     Mode   
beta               6     dgCMatrix S4     
model             12     elnet     list   
beta.first         6     dgCMatrix S4     
model.first       12     elnet     list   
best.alpha.enet    1     -none-    numeric
best.alpha.aenet   1     -none-    numeric
best.lambda.enet   1     -none-    numeric
best.lambda.aenet  1     -none-    numeric
step.criterion     2     -none-    numeric
adpen              6     -none-    numeric
seed               1     -none-    numeric
call               3     -none-    call   
> str(fit)
List of 12
 $ beta             :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:4] 0 1 2 3
  .. ..@ p       : int [1:2] 0 4
  .. ..@ Dim     : int [1:2] 6 1
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. ..$ : chr "s0"
  .. ..@ x       : num [1:4] 0.98 1.026 0.997 0.978
  .. ..@ factors : list()
 $ model            :List of 12
  ..$ a0       : Named num 0.0248
  .. ..- attr(*, "names")= chr "s0"
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:4] 0 1 2 3
  .. .. ..@ p       : int [1:2] 0 4
  .. .. ..@ Dim     : int [1:2] 6 1
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr "s0"
  .. .. ..@ x       : num [1:4] 0.98 1.026 0.997 0.978
  .. .. ..@ factors : list()
  ..$ df       : int 4
  ..$ dim      : int [1:2] 6 1
  ..$ lambda   : num 1.11e+13
  ..$ dev.ratio: num 0.945
  ..$ nulldev  : num 2235
  ..$ npasses  : int 5
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y, family = family, alpha = best.alpha.aenet, lambda = best.lambda.aenet,      penalty.factor =| __truncated__
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ beta.first       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:5] 0 1 2 3 5
  .. ..@ p       : int [1:2] 0 5
  .. ..@ Dim     : int [1:2] 6 1
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. ..$ : chr "s0"
  .. ..@ x       : num [1:5] 0.98 1.027 0.999 0.978 -0.017
  .. ..@ factors : list()
 $ model.first      :List of 12
  ..$ a0       : Named num 0.025
  .. ..- attr(*, "names")= chr "s0"
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:5] 0 1 2 3 5
  .. .. ..@ p       : int [1:2] 0 5
  .. .. ..@ Dim     : int [1:2] 6 1
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr "s0"
  .. .. ..@ x       : num [1:5] 0.98 1.027 0.999 0.978 -0.017
  .. .. ..@ factors : list()
  ..$ df       : int 5
  ..$ dim      : int [1:2] 6 1
  ..$ lambda   : num 0.00686
  ..$ dev.ratio: num 0.945
  ..$ nulldev  : num 2235
  ..$ npasses  : int 5
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y, family = family, alpha = best.alpha.enet, lambda = best.lambda.enet,      penalty.factor = p| __truncated__ ...
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ best.alpha.enet  : num 0.85
 $ best.alpha.aenet : num 0.05
 $ best.lambda.enet : num 0.00686
 $ best.lambda.aenet: num 1.11e+13
 $ step.criterion   : num [1:2] 0.501 0.502
 $ adpen            : num [1:6, 1] 1.02 9.73e-01 1.00 1.02 4.50e+15 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. ..$ : chr "s0"
 $ seed             : num 1001
 $ call             : language msaenet::aenet(x = x, y = y)
 - attr(*, "class")= chr [1:2] "msaenet" "msaenet.aenet"
> coef(fit)
[1] 0.9799217 1.0258625 0.9968319 0.9781661 0.0000000 0.0000000

3、ncvreg

> fit <- ncvreg::cv.ncvreg(x, y)
> summary(fit)
MCP-penalized linear regression with n=500, p=6
At minimum cross-validation error (lambda=0.1781):
-------------------------------------------------
  Nonzero coefficients: 4
  Cross-validation error (deviance): 0.25
  R-squared: 0.94
  Signal-to-noise ratio: 16.69
  Scale estimate (sigma): 0.503
> str(fit)
List of 9
 $ cve       : num [1:100] 4.45 4.25 3.91 3.36 2.76 ...
 $ cvse      : num [1:100] 0.278 0.267 0.246 0.211 0.174 ...
 $ fold      : num [1:500] 2 2 5 

   
 (c)2006-2024 SYSTEM All Rights Reserved  IT常识