当“对比只能应用于具有 2 个或更多水平的因素”时如何进行 GLM?
Posted
技术标签:
【中文标题】当“对比只能应用于具有 2 个或更多水平的因素”时如何进行 GLM?【英文标题】:How to do a GLM when "contrasts can be applied only to factors with 2 or more levels"? 【发布时间】:2018-10-22 03:53:54 【问题描述】:我想在 R 中使用 glm
进行回归,但有没有办法做到这一点,因为我得到了对比错误。
mydf <- data.frame(Group=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12),
WL=rep(c(1,0),12),
New.Runner=c("N","N","N","N","N","N","Y","N","N","N","N","N","N","Y","N","N","N","Y","N","N","N","N","N","Y"),
Last.Run=c(1,5,2,6,5,4,NA,3,7,2,4,9,8,NA,3,5,1,NA,6,10,7,9,2,NA))
mod <- glm(formula = WL~New.Runner+Last.Run, family = binomial, data = mydf)
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
【问题讨论】:
【参考方案1】:使用此处定义的debug_contr_error
和debug_contr_error2
函数:How to debug “contrasts can be applied only to factors with 2 or more levels” error? 我们可以轻松定位问题:变量New.Runner
中只剩下一个级别。
info <- debug_contr_error2(WL ~ New.Runner + Last.Run, mydf)
info[c(2, 3)]
#$nlevels
#New.Runner
# 1
#
#$levels
#$levels$New.Runner
#[1] "N"
## the data frame that is actually used by `glm`
dat <- info$mf
不能对单个级别的因子应用对比,因为任何类型的对比都会将级别数减少1
。通过1 - 1 = 0
,此变量将从模型矩阵中删除。
那么,我们是否可以简单地要求不对单级因素应用对比?不可以。所有对比方法都禁止这样做:
contr.helmert(n = 1, contrasts = FALSE)
#Error in contr.helmert(n = 1, contrasts = FALSE) :
# not enough degrees of freedom to define contrasts
contr.poly(n = 1, contrasts = FALSE)
#Error in contr.poly(n = 1, contrasts = FALSE) :
# contrasts not defined for 0 degrees of freedom
contr.sum(n = 1, contrasts = FALSE)
#Error in contr.sum(n = 1, contrasts = FALSE) :
# not enough degrees of freedom to define contrasts
contr.treatment(n = 1, contrasts = FALSE)
#Error in contr.treatment(n = 1, contrasts = FALSE) :
# not enough degrees of freedom to define contrasts
contr.SAS(n = 1, contrasts = FALSE)
#Error in contr.treatment(n, base = if (is.numeric(n) && length(n) == 1L) n else length(n), :
# not enough degrees of freedom to define contrasts
其实,如果你仔细想想,你会得出结论,没有对比,一个单一水平的因素只是一个全1的虚拟变量,即截距。所以,你绝对可以做到以下几点:
dat$New.Runner <- 1 ## set it to 1, as if no contrasts is applied
mod <- glm(formula = WL ~ New.Runner + Last.Run, family = binomial, data = dat)
#(Intercept) New.Runner Last.Run
# 1.4582 NA -0.2507
由于rank-deficiency,您将获得New.Runner
的NA
系数。事实上,applying contrasts is a fundamental way to avoid rank-deficiency。只是当一个因素只有一个层次时,对比的应用就成了一个悖论。
我们也来看看模型矩阵:
model.matrix(mod)
# (Intercept) New.Runner Last.Run
#1 1 1 1
#2 1 1 5
#3 1 1 2
#4 1 1 6
#5 1 1 5
#6 1 1 4
#8 1 1 3
#9 1 1 7
#10 1 1 2
#11 1 1 4
#12 1 1 9
#13 1 1 8
#15 1 1 3
#16 1 1 5
#17 1 1 1
#19 1 1 6
#20 1 1 10
#21 1 1 7
#22 1 1 9
#23 1 1 2
(intercept)
和 New.Runner
具有相同的列,并且只能估计其中之一。如果你想估计New.Runner
,就把截距去掉:
glm(formula = WL ~ 0 + New.Runner + Last.Run, family = binomial, data = dat)
#New.Runner Last.Run
# 1.4582 -0.2507
确保彻底消化排名不足的问题。如果您有多个单级因子并将它们全部替换为 1,则删除单个截距仍会导致排名不足。
dat$foo.factor <- 1
glm(formula = WL ~ 0 + New.Runner + foo.factor + Last.Run, family = binomial, data = dat)
#New.Runner foo.factor Last.Run
# 1.4582 NA -0.2507
【讨论】:
以上是关于当“对比只能应用于具有 2 个或更多水平的因素”时如何进行 GLM?的主要内容,如果未能解决你的问题,请参考以下文章
回归在输入变量之一上给出错误“对比只能应用于具有2个或更多水平的因素”[重复]