R计算具有奇点的lm模型的稳健标准误差(vcovHC)
Posted
技术标签:
【中文标题】R计算具有奇点的lm模型的稳健标准误差(vcovHC)【英文标题】:R calculate robust standard errors (vcovHC) for lm model with singularities 【发布时间】:2012-03-09 06:54:35 【问题描述】:在 R 中,当某些系数由于奇异性而被丢弃时,如何使用 vcovHC() 计算稳健的标准误差?标准 lm 函数似乎可以很好地计算所有实际估计的系数的正常标准误差,但 vcovHC() 会引发错误:“面包错误。%*% 肉。:不符合标准的参数”。
(我使用的实际数据有点复杂。实际上,它是一个使用两种不同固定效应的模型,我遇到了无法简单摆脱的局部奇点。至少我不知道如何. 对于我使用的两个固定效应,第一个因子有 150 个级别,第二个因子有 142 个级别,总共有 9 个奇点,这是因为数据是在十个块中收集的。)
这是我的输出:
Call:
lm(formula = one ~ two + three + Jan + Feb + Mar + Apr + May +
Jun + Jul + Aug + Sep + Oct + Nov + Dec, data = dat)
Residuals:
Min 1Q Median 3Q Max
-130.12 -60.95 0.08 61.05 137.35
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1169.74313 57.36807 20.390 <2e-16 ***
two -0.07963 0.06720 -1.185 0.237
three -0.04053 0.06686 -0.606 0.545
Jan 8.10336 22.05552 0.367 0.714
Feb 0.44025 22.11275 0.020 0.984
Mar 19.65066 22.02454 0.892 0.373
Apr -13.19779 22.02886 -0.599 0.550
May 15.39534 22.10445 0.696 0.487
Jun -12.50227 22.07013 -0.566 0.572
Jul -20.58648 22.06772 -0.933 0.352
Aug -0.72223 22.36923 -0.032 0.974
Sep 12.42204 22.09296 0.562 0.574
Oct 25.14836 22.04324 1.141 0.255
Nov 18.13337 22.08717 0.821 0.413
Dec NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 69.63 on 226 degrees of freedom
Multiple R-squared: 0.04878, Adjusted R-squared: -0.005939
F-statistic: 0.8914 on 13 and 226 DF, p-value: 0.5629
> model$se <- vcovHC(model)
Error in bread. %*% meat. : non-conformable arguments
这里是一个用于重现错误的最小代码。
library(sandwich)
set.seed(101)
dat<-data.frame(one=c(sample(1000:1239)),
two=c(sample(200:439)),
three=c(sample(600:839)),
Jan=c(rep(1,20),rep(0,220)),
Feb=c(rep(0,20),rep(1,20),rep(0,200)),
Mar=c(rep(0,40),rep(1,20),rep(0,180)),
Apr=c(rep(0,60),rep(1,20),rep(0,160)),
May=c(rep(0,80),rep(1,20),rep(0,140)),
Jun=c(rep(0,100),rep(1,20),rep(0,120)),
Jul=c(rep(0,120),rep(1,20),rep(0,100)),
Aug=c(rep(0,140),rep(1,20),rep(0,80)),
Sep=c(rep(0,160),rep(1,20),rep(0,60)),
Oct=c(rep(0,180),rep(1,20),rep(0,40)),
Nov=c(rep(0,200),rep(1,20),rep(0,20)),
Dec=c(rep(0,220),rep(1,20)))
model <- lm(one ~ two + three + Jan + Feb + Mar + Apr + May + Jun + Jul + Aug + Sep + Oct + Nov + Dec, data=dat)
summary(model)
model$se <- vcovHC(model)
【问题讨论】:
您的代码似乎有效。您可能希望从模型中明确删除其中一个月份(或截距),以避免奇点。 不幸的是我的意思是:我无法消除奇点。这只是我发布的一个简单示例数据集。在那个数据集中,我同意:你可以简单地从回归中删除 Dec,从而摆脱奇点,然后 vcovHC() 就可以了。在我的实际数据中,奇点源于具有多个级别(分别为 150 和 142)的两个固定效应,这有点复杂。我还没有找到摆脱该数据中的奇点的方法。 @Chris:你还有这个错误吗?在将Dec
更改为c(rep(0,240))
以诱导奇点后,对vcovHC(model)
的调用成功,而不会出现您注意到的错误。在三明治 2.2-9 的更新日志中:lm/mlm/glm models with aliased parameters were not handled correctly (leading to errors in sandwich/vcovHC etc.), fixed now.
也许修复了它?
【参考方案1】:
具有奇点的模型永远都不好,它们应该被修复。在你的例子中,你有 12 个月的 12 个系数,还有全局截距!所以实际上只有 12 个实际参数需要估计 13 个系数。您真正想要的是禁用全局拦截 - 所以您将拥有更像是特定月份的拦截:
> model <- lm(one ~ 0 + two + three + Jan + Feb + Mar + Apr + May + Jun + Jul + Aug + Sep + Oct + Nov + Dec, data=dat)
> summary(model)
Call:
lm(formula = one ~ 0 + two + three + Jan + Feb + Mar + Apr +
May + Jun + Jul + Aug + Sep + Oct + Nov + Dec, data = dat)
Residuals:
Min 1Q Median 3Q Max
-133.817 -55.636 3.329 56.768 126.772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
two -0.09670 0.06621 -1.460 0.146
three 0.02446 0.06666 0.367 0.714
Jan 1130.05812 52.79625 21.404 <2e-16 ***
Feb 1121.32904 55.18864 20.318 <2e-16 ***
Mar 1143.50310 53.59603 21.336 <2e-16 ***
Apr 1143.95365 54.99724 20.800 <2e-16 ***
May 1136.36429 53.38218 21.287 <2e-16 ***
Jun 1129.86010 53.85865 20.978 <2e-16 ***
Jul 1105.10045 54.94940 20.111 <2e-16 ***
Aug 1147.47152 54.57201 21.027 <2e-16 ***
Sep 1139.42205 53.58611 21.263 <2e-16 ***
Oct 1117.75075 55.35703 20.192 <2e-16 ***
Nov 1129.20208 53.54934 21.087 <2e-16 ***
Dec 1149.55556 53.52499 21.477 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared: 0.9964, Adjusted R-squared: 0.9961
F-statistic: 4409 on 14 and 226 DF, p-value: < 2.2e-16
那么,这是一个正常的模型,所以你不应该对 vcovHC 有任何问题。
【讨论】:
【参考方案2】:您的目标似乎是固定效应估计,虽然这个问题是不久前提出的,但我遇到了同样的问题,这是我的解决方案:
可以通过在估计方程中包含 + factor()
来控制固定效应:
所以我先创建了一个额外的列:
# create an addtitional column in your data
dat$month <- "0"
#this column will contain the month, not a dummy for months
for (i in 1:length(dat$one))
if (dat[i,"Jan"]==1)
dat[i,"month"]<- "Jan"
if (dat[i,"Feb"]==1)
dat[i,"month"]<- "Feb"
if (dat[i,"Mar"]==1)
dat[i,"month"]<- "Mar"
if (dat[i,"Apr"]==1)
dat[i,"month"]<- "Apr"
if (dat[i,"May"]==1)
dat[i,"month"]<- "May"
if (dat[i,"Jun"]==1)
dat[i,"month"]<- "Jun"
if (dat[i,"Jul"]==1)
dat[i,"month"]<- "Jul"
if (dat[i,"Aug"]==1)
dat[i,"month"]<- "Aug"
if (dat[i,"Sep"]==1)
dat[i,"month"]<- "Sep"
if (dat[i,"Oct"]==1)
dat[i,"month"]<- "Oct"
if (dat[i,"Nov"]==1)
dat[i,"month"]<- "Nov"
if (dat[i,"Dec"]==1)
dat[i,"month"]<- "Dec"
i <- NULL
此列现在可以用作回归方程中的固定或恒定影响因子:
> #you can use the created column as fixed effect factor in your
+ regression
> model_A <- lm(one ~ two + three + factor(month), data=dat)
> summary(model_A)
Call:
lm(formula = one ~ two + three + factor(month), data = dat)
Residuals:
Min 1Q Median 3Q Max
-133.817 -55.636 3.329 56.768 126.772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1143.95365 54.99724 20.800 <2e-16 ***
two -0.09670 0.06621 -1.460 0.1455
three 0.02446 0.06666 0.367 0.7141
factor(month)Aug 3.51788 22.09948 0.159 0.8737
factor(month)Dec 5.60192 22.41204 0.250 0.8029
factor(month)Feb -22.62460 22.10889 -1.023 0.3072
factor(month)Jan -13.89553 22.25117 -0.624 0.5329
factor(month)Jul -38.85320 22.13980 -1.755 0.0806 .
factor(month)Jun -14.09355 22.18707 -0.635 0.5259
factor(month)Mar -0.45055 22.13638 -0.020 0.9838
factor(month)May -7.58935 22.14137 -0.343 0.7321
factor(month)Nov -14.75156 22.27288 -0.662 0.5084
factor(month)Oct -26.20290 22.09416 -1.186 0.2369
factor(month)Sep -4.53159 22.26334 -0.204 0.8389
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared: 0.04381, Adjusted R-squared: -0.01119
F-statistic: 0.7966 on 13 and 226 DF, p-value: 0.6635
> #and also do the same without intercept if so needed
> model_B <- lm(one ~ 0 + two + three + factor(month), data=dat)
> summary(model_B)
Call:
lm(formula = one ~ 0 + two + three + factor(month), data = dat)
Residuals:
Min 1Q Median 3Q Max
-133.817 -55.636 3.329 56.768 126.772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
two -0.09670 0.06621 -1.460 0.146
three 0.02446 0.06666 0.367 0.714
factor(month)Apr 1143.95365 54.99724 20.800 <2e-16 ***
factor(month)Aug 1147.47152 54.57201 21.027 <2e-16 ***
factor(month)Dec 1149.55556 53.52499 21.477 <2e-16 ***
factor(month)Feb 1121.32904 55.18864 20.318 <2e-16 ***
factor(month)Jan 1130.05812 52.79625 21.404 <2e-16 ***
factor(month)Jul 1105.10045 54.94940 20.111 <2e-16 ***
factor(month)Jun 1129.86010 53.85865 20.978 <2e-16 ***
factor(month)Mar 1143.50310 53.59603 21.336 <2e-16 ***
factor(month)May 1136.36429 53.38218 21.287 <2e-16 ***
factor(month)Nov 1129.20208 53.54934 21.087 <2e-16 ***
factor(month)Oct 1117.75075 55.35703 20.192 <2e-16 ***
factor(month)Sep 1139.42205 53.58611 21.263 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared: 0.9964, Adjusted R-squared: 0.9961
F-statistic: 4409 on 14 and 226 DF, p-value: < 2.2e-16
这使您可以对面板数据运行常规 OLS 回归。
【讨论】:
以上是关于R计算具有奇点的lm模型的稳健标准误差(vcovHC)的主要内容,如果未能解决你的问题,请参考以下文章
R语言回归模型残差标准误差计算实战(Residual Standard Error):计算残差标准误残差标准误解读
R语言使用lm构建线性回归模型并将目标变量对数化实战:模型训练集和测试集的残差总结信息(residiual summary)模型训练(测试)集自由度计算模型训练(测试)集残差标准误计算