变量选择——lassoSCADMCP的实现(R语言)

Posted 统计学小王子

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了变量选择——lassoSCADMCP的实现(R语言)相关的知识,希望对你有一定的参考价值。

0引言

自1996年lasso被提出以来,很多学者针对不同的模型提出有效的算法进行计算,例如多元线性线性模型、cox回归模型、广义线性模型等。也给出了很多选择惩罚参数的方式,比如cv、aic、bic。还有很多惩罚类型:lasso、适应性lasso、弹性网、SCAD、MCP。
本文主要介绍下面三个包:glmnet、ncvreg、msaenet。先汇总每个包的主要函数、方法。如下表:

函数模型惩罚参数惩罚类型是否弹性网
glmnet“gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian”lassoglmnet
cv.glmnet“gaussian”, “binomial”, “poisson”, “multinomial”, “cox”, “mgaussian”CVlassoglmnet
aenet“gaussian”, “binomial”, “poisson”, “cox”“cv”, “ebic”, “bic”, “aic”Adaptive lassomsaenet
asnet“gaussian”, “binomial”, “poisson”, “cox”“cv”, “ebic”, “bic”, “aic”SCADmsaenet
amnet“gaussian”, “binomial”, “poisson”, “cox”“cv”, “ebic”, “bic”, “aic”MCPmsaenet
ncvreg“gaussian”, “binomial”, “poisson”“MCP”, “SCAD”, “lasso”ncvreg
cv.ncvreg“gaussian”, “binomial”, “poisson”CV“MCP”, “SCAD”, “lasso”ncvreg

做lasso类变量选择的包有很多,上表只列出了三个我常用的。有知道更多包的大佬欢迎评论区留言或私信给我们安利,大家一块扩充知识。
上表给出了包和函数主要的参数和模型,大部分熟悉R语言的可以很容易的安装实现自己想要的模型。但是对于初学者可能有些困难,为了大部分人快速入手上述包。下面生成合适的模型数据,以高斯的误差和lasso(或者Adaptive lasso)为例,使用CV选择作为选择惩罚因子的方法给出实现例子。

1、glmnet

1.1生成数据

本文使用的数据都是本节生成的,下面是生成代码:

> # R版本 3.6.3
> Beta <- cbind(c(1,1,1,1,0,0)) # 真是的beta
> n <- 500  # 样本量
> set.seed(1234)  # 为了本文的复现,设置随机种子
> x1 <- rnorm(n)
> x2 <- rnorm(n)
> x3 <- rnorm(n)
> x4 <- rnorm(n)
> x5 <- rnorm(n)
> x6 <- rnorm(n)
> x <- cbind(x1, x2, x3, x4, x5, x6)
> y <- x %*% Beta + rnorm(n, 0, 0.5)
> 
> # 数据展示
> head(x, 10)
              x1          x2         x3         x4         x5         x6
 [1,] -1.2070657  0.98477997 -1.2053334  1.7940745 -0.9738186 -1.3662376
 [2,]  0.2774292 -1.22473788  0.3014667 -1.3645489 -0.0996312  0.5392093
 [3,]  1.0844412  0.70972622 -1.5391452 -0.7074400 -0.1107350 -1.3219320
 [4,] -2.3456977 -0.10921999  0.6353707 -0.5562843  1.1921946 -0.2812887
 [5,]  0.4291247  1.78260790  0.7029518 -0.3100811 -1.6558859 -2.1049469
 [6,]  0.5060559 -0.24344468 -1.9058829 -0.3761793 -1.0456433 -1.6176047
 [7,] -0.5747400 -1.52710702  0.9389214  0.4585260 -1.7402391 -0.7237319
 [8,] -0.5466319  0.49183437 -0.2244921 -1.2611491  0.5131208  0.3067410
 [9,] -0.5644520  0.35450366 -0.6738168 -0.5274652 -0.4459568  0.2255962
[10,] -0.8900378 -0.01762635  0.4457874 -0.5568142 -1.8391938  0.9357160
> head(y, 10)
            [,1]
 [1,]  0.1739191
 [2,] -1.7533630
 [3,] -0.2984157
 [4,] -1.4562544
 [5,]  3.4013056
 [6,] -2.1985494
 [7,] -1.2608381
 [8,] -1.2924795
 [9,] -2.0985074
[10,] -0.7405859

1.2 glmnet::cv.glmnet

> (fit <- glmnet::cv.glmnet(x, y))

Call:  glmnet::cv.glmnet(x = x, y = y) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.01019    52  0.2512 0.009655       5
1se 0.05439    34  0.2605 0.010606       4
> summary(fit)
           Length Class  Mode     
lambda     57     -none- numeric  
cvm        57     -none- numeric  
cvsd       57     -none- numeric  
cvup       57     -none- numeric  
cvlo       57     -none- numeric  
nzero      57     -none- numeric  
call        3     -none- call     
name        1     -none- character
glmnet.fit 12     elnet  list     
lambda.min  1     -none- numeric  
lambda.1se  1     -none- numeric  
index       2     -none- numeric  
> str(fit)
List of 12
 $ lambda    : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
 $ cvm       : num [1:57] 4.46 4.23 3.79 3.23 2.73 ...
 $ cvsd      : num [1:57] 0.369 0.376 0.364 0.323 0.272 ...
 $ cvup      : num [1:57] 4.83 4.6 4.16 3.56 3 ...
 $ cvlo      : num [1:57] 4.09 3.85 3.43 2.91 2.45 ...
 $ nzero     : Named int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
  ..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
 $ call      : language glmnet::cv.glmnet(x = x, y = y)
 $ name      : Named chr "Mean-Squared Error"
  ..- attr(*, "names")= chr "mse"
 $ glmnet.fit:List of 12
  ..$ a0       : Named num [1:57] -0.000854 -0.000831 0.004037 0.005898 0.007593 ...
  .. ..- attr(*, "names")= chr [1:57] "s0" "s1" "s2" "s3" ...
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:236] 0 1 0 1 2 3 0 1 2 3 ...
  .. .. ..@ p       : int [1:58] 0 0 2 6 10 14 18 22 26 30 ...
  .. .. ..@ Dim     : int [1:2] 6 57
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr [1:57] "s0" "s1" "s2" "s3" ...
  .. .. ..@ x       : num [1:236] 0.10041 0.00378 0.18545 0.09365 0.0017 ...
  .. .. ..@ factors : list()
  ..$ df       : int [1:57] 0 2 4 4 4 4 4 4 4 4 ...
  ..$ dim      : int [1:2] 6 57
  ..$ lambda   : num [1:57] 1.172 1.068 0.973 0.886 0.808 ...
  ..$ dev.ratio: num [1:57] 0 0.0537 0.1566 0.2905 0.4016 ...
  ..$ nulldev  : num 2235
  ..$ npasses  : int 206
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y)
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ lambda.min: num 0.0102
 $ lambda.1se: num 0.0544
 $ index     : int [1:2, 1] 52 34
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "min" "1se"
  .. ..$ : chr "Lambda"
 - attr(*, "class")= chr "cv.glmnet"
> coef(fit)
7 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 0.02380925
x1          0.94151352
x2          0.98011080
x3          0.94789578
x4          0.93311113
x5          .         
x6          .         
> 

2、msaenet

> fit <- msaenet::aenet(x, y)
> summary(fit)
                  Length Class     Mode   
beta               6     dgCMatrix S4     
model             12     elnet     list   
beta.first         6     dgCMatrix S4     
model.first       12     elnet     list   
best.alpha.enet    1     -none-    numeric
best.alpha.aenet   1     -none-    numeric
best.lambda.enet   1     -none-    numeric
best.lambda.aenet  1     -none-    numeric
step.criterion     2     -none-    numeric
adpen              6     -none-    numeric
seed               1     -none-    numeric
call               3     -none-    call   
> str(fit)
List of 12
 $ beta             :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:4] 0 1 2 3
  .. ..@ p       : int [1:2] 0 4
  .. ..@ Dim     : int [1:2] 6 1
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. ..$ : chr "s0"
  .. ..@ x       : num [1:4] 0.98 1.026 0.997 0.978
  .. ..@ factors : list()
 $ model            :List of 12
  ..$ a0       : Named num 0.0248
  .. ..- attr(*, "names")= chr "s0"
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:4] 0 1 2 3
  .. .. ..@ p       : int [1:2] 0 4
  .. .. ..@ Dim     : int [1:2] 6 1
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr "s0"
  .. .. ..@ x       : num [1:4] 0.98 1.026 0.997 0.978
  .. .. ..@ factors : list()
  ..$ df       : int 4
  ..$ dim      : int [1:2] 6 1
  ..$ lambda   : num 1.11e+13
  ..$ dev.ratio: num 0.945
  ..$ nulldev  : num 2235
  ..$ npasses  : int 5
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y, family = family, alpha = best.alpha.aenet, lambda = best.lambda.aenet,      penalty.factor =| __truncated__
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ beta.first       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. ..@ i       : int [1:5] 0 1 2 3 5
  .. ..@ p       : int [1:2] 0 5
  .. ..@ Dim     : int [1:2] 6 1
  .. ..@ Dimnames:List of 2
  .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. ..$ : chr "s0"
  .. ..@ x       : num [1:5] 0.98 1.027 0.999 0.978 -0.017
  .. ..@ factors : list()
 $ model.first      :List of 12
  ..$ a0       : Named num 0.025
  .. ..- attr(*, "names")= chr "s0"
  ..$ beta     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  .. .. ..@ i       : int [1:5] 0 1 2 3 5
  .. .. ..@ p       : int [1:2] 0 5
  .. .. ..@ Dim     : int [1:2] 6 1
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. .. .. ..$ : chr "s0"
  .. .. ..@ x       : num [1:5] 0.98 1.027 0.999 0.978 -0.017
  .. .. ..@ factors : list()
  ..$ df       : int 5
  ..$ dim      : int [1:2] 6 1
  ..$ lambda   : num 0.00686
  ..$ dev.ratio: num 0.945
  ..$ nulldev  : num 2235
  ..$ npasses  : int 5
  ..$ jerr     : int 0
  ..$ offset   : logi FALSE
  ..$ call     : language glmnet(x = x, y = y, family = family, alpha = best.alpha.enet, lambda = best.lambda.enet,      penalty.factor = p| __truncated__ ...
  ..$ nobs     : int 500
  ..- attr(*, "class")= chr [1:2] "elnet" "glmnet"
 $ best.alpha.enet  : num 0.85
 $ best.alpha.aenet : num 0.05
 $ best.lambda.enet : num 0.00686
 $ best.lambda.aenet: num 1.11e+13
 $ step.criterion   : num [1:2] 0.501 0.502
 $ adpen            : num [1:6, 1] 1.02 9.73e-01 1.00 1.02 4.50e+15 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "x1" "x2" "x3" "x4" ...
  .. ..$ : chr "s0"
 $ seed             : num 1001
 $ call             : language msaenet::aenet(x = x, y = y)
 - attr(*, "class")= chr [1:2] "msaenet" "msaenet.aenet"
> coef(fit)
[1] 0.9799217 1.0258625 0.9968319 0.9781661 0.0000000 0.0000000

3、ncvreg

> fit <- ncvreg::cv.ncvreg(x, y)
> summary(fit)
MCP-penalized linear regression with n=500, p=6
At minimum cross-validation error (lambda=0.1781):
-------------------------------------------------
  Nonzero coefficients: 4
  Cross-validation error (deviance): 0.25
  R-squared: 0.94
  Signal-to-noise ratio: 16.69
  Scale estimate (sigma): 0.503
> str(fit)
List of 9
 $ cve       : num [1:100] 4.45 4.25 3.91 3.36 2.76 ...
 $ cvse      : num [1:100] 0.278 0.267 0.246 0.211 0.174 ...
 $ fold      : num [1:500] 2 2 5 R语言 | 多元回归中常见的变量选择方法

机器学习:R语言实现随机森林

R语言 | randomForest包的随机森林回归模型以及对重要变量的选择

弹性网惩罚项的可视化分析(R语言)

弹性网惩罚项的可视化分析(R语言)

R语言基于R语言的数据挖掘之决策树