R计算具有奇点的lm模型的稳健标准误差(vcovHC)

Posted

技术标签:

【中文标题】R计算具有奇点的lm模型的稳健标准误差(vcovHC)【英文标题】:R calculate robust standard errors (vcovHC) for lm model with singularities 【发布时间】:2012-03-09 06:54:35 【问题描述】:

在 R 中,当某些系数由于奇异性而被丢弃时,如何使用 vcovHC() 计算稳健的标准误差?标准 lm 函数似乎可以很好地计算所有实际估计的系数的正常标准误差,但 vcovHC() 会引发错误:“面包错误。%*% 肉。:不符合标准的参数”。

(我使用的实际数据有点复杂。实际上,它是一个使用两种不同固定效应的模型,我遇到了无法简单摆脱的局部奇点。至少我不知道如何. 对于我使用的两个固定效应,第一个因子有 150 个级别,第二个因子有 142 个级别,总共有 9 个奇点,这是因为数据是在十个块中收集的。)

这是我的输出:

Call:
lm(formula = one ~ two + three + Jan + Feb + Mar + Apr + May + 
Jun + Jul + Aug + Sep + Oct + Nov + Dec, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-130.12  -60.95    0.08   61.05  137.35 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1169.74313   57.36807  20.390   <2e-16 ***
two           -0.07963    0.06720  -1.185    0.237    
three         -0.04053    0.06686  -0.606    0.545    
Jan            8.10336   22.05552   0.367    0.714    
Feb            0.44025   22.11275   0.020    0.984    
Mar           19.65066   22.02454   0.892    0.373    
Apr          -13.19779   22.02886  -0.599    0.550    
May           15.39534   22.10445   0.696    0.487    
Jun          -12.50227   22.07013  -0.566    0.572    
Jul          -20.58648   22.06772  -0.933    0.352    
Aug           -0.72223   22.36923  -0.032    0.974    
Sep           12.42204   22.09296   0.562    0.574    
Oct           25.14836   22.04324   1.141    0.255    
Nov           18.13337   22.08717   0.821    0.413    
Dec                 NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 69.63 on 226 degrees of freedom
Multiple R-squared: 0.04878,    Adjusted R-squared: -0.005939 
F-statistic: 0.8914 on 13 and 226 DF,  p-value: 0.5629 

> model$se <- vcovHC(model)
Error in bread. %*% meat. : non-conformable arguments

这里是一个用于重现错误的最小代码。

library(sandwich)
set.seed(101)
dat<-data.frame(one=c(sample(1000:1239)),
              two=c(sample(200:439)),
              three=c(sample(600:839)),
              Jan=c(rep(1,20),rep(0,220)),
              Feb=c(rep(0,20),rep(1,20),rep(0,200)),
              Mar=c(rep(0,40),rep(1,20),rep(0,180)),
              Apr=c(rep(0,60),rep(1,20),rep(0,160)),
              May=c(rep(0,80),rep(1,20),rep(0,140)),
              Jun=c(rep(0,100),rep(1,20),rep(0,120)),
              Jul=c(rep(0,120),rep(1,20),rep(0,100)),
              Aug=c(rep(0,140),rep(1,20),rep(0,80)),
              Sep=c(rep(0,160),rep(1,20),rep(0,60)),
              Oct=c(rep(0,180),rep(1,20),rep(0,40)),
              Nov=c(rep(0,200),rep(1,20),rep(0,20)),
              Dec=c(rep(0,220),rep(1,20))) 
model <- lm(one ~ two + three + Jan + Feb + Mar + Apr + May + Jun + Jul + Aug + Sep + Oct + Nov + Dec, data=dat)
summary(model)
model$se <- vcovHC(model)

【问题讨论】:

您的代码似乎有效。您可能希望从模型中明确删除其中一个月份(或截距),以避免奇点。 不幸的是我的意思是:我无法消除奇点。这只是我发布的一个简单示例数据集。在那个数据集中,我同意:你可以简单地从回归中删除 Dec,从而摆脱奇点,然后 vcovHC() 就可以了。在我的实际数据中,奇点源于具有多个级别(分别为 150 和 142)的两个固定效应,这有点复杂。我还没有找到摆脱该数据中的奇点的方法。 @Chris:你还有这个错误吗?在将Dec 更改为c(rep(0,240)) 以诱导奇点后,对vcovHC(model) 的调用成功,而不会出现您注意到的错误。在三明治 2.2-9 的更新日志中:lm/mlm/glm models with aliased parameters were not handled correctly (leading to errors in sandwich/vcovHC etc.), fixed now. 也许修复了它? 【参考方案1】:

具有奇点的模型永远都不好,它们应该被修复。在你的例子中,你有 12 个月的 12 个系数,还有全局截距!所以实际上只有 12 个实际参数需要估计 13 个系数。您真正想要的是禁用全局拦截 - 所以您将拥有更像是特定月份的拦截:

> model <- lm(one ~ 0 + two + three + Jan + Feb + Mar + Apr + May + Jun + Jul + Aug + Sep + Oct + Nov + Dec, data=dat)
> summary(model)

Call:
lm(formula = one ~ 0 + two + three + Jan + Feb + Mar + Apr + 
    May + Jun + Jul + Aug + Sep + Oct + Nov + Dec, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-133.817  -55.636    3.329   56.768  126.772 

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
two     -0.09670    0.06621  -1.460    0.146    
three    0.02446    0.06666   0.367    0.714    
Jan   1130.05812   52.79625  21.404   <2e-16 ***
Feb   1121.32904   55.18864  20.318   <2e-16 ***
Mar   1143.50310   53.59603  21.336   <2e-16 ***
Apr   1143.95365   54.99724  20.800   <2e-16 ***
May   1136.36429   53.38218  21.287   <2e-16 ***
Jun   1129.86010   53.85865  20.978   <2e-16 ***
Jul   1105.10045   54.94940  20.111   <2e-16 ***
Aug   1147.47152   54.57201  21.027   <2e-16 ***
Sep   1139.42205   53.58611  21.263   <2e-16 ***
Oct   1117.75075   55.35703  20.192   <2e-16 ***
Nov   1129.20208   53.54934  21.087   <2e-16 ***
Dec   1149.55556   53.52499  21.477   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared:  0.9964,    Adjusted R-squared:  0.9961 
F-statistic:  4409 on 14 and 226 DF,  p-value: < 2.2e-16

那么,这是一个正常的模型,所以你不应该对 vcovHC 有任何问题。

【讨论】:

【参考方案2】:

您的目标似乎是固定效应估计,虽然这个问题是不久前提出的,但我遇到了同样的问题,这是我的解决方案: 可以通过在估计方程中包含 + factor() 来控制固定效应:

所以我先创建了一个额外的列:

# create an addtitional column in your data 
dat$month <- "0"
#this column will contain the month, not a dummy for months
for	(i in 1:length(dat$one))
	if (dat[i,"Jan"]==1)
	dat[i,"month"]<- "Jan"
	if (dat[i,"Feb"]==1)
	dat[i,"month"]<- "Feb"
	if (dat[i,"Mar"]==1)
	dat[i,"month"]<- "Mar"
	if (dat[i,"Apr"]==1)
	dat[i,"month"]<- "Apr"
	if (dat[i,"May"]==1)
	dat[i,"month"]<- "May"
	if (dat[i,"Jun"]==1)
	dat[i,"month"]<- "Jun"
	if (dat[i,"Jul"]==1)
	dat[i,"month"]<- "Jul"
	if (dat[i,"Aug"]==1)
	dat[i,"month"]<- "Aug"
	if (dat[i,"Sep"]==1)
	dat[i,"month"]<- "Sep"
	if (dat[i,"Oct"]==1)
	dat[i,"month"]<- "Oct"
	if (dat[i,"Nov"]==1)
	dat[i,"month"]<- "Nov"
	if (dat[i,"Dec"]==1)
	dat[i,"month"]<- "Dec"

i <- NULL

此列现在可以用作回归方程中的固定或恒定影响因子:

> #you can use the created column as fixed effect factor in your 
+ regression 
> model_A <- lm(one ~ two + three + factor(month), data=dat)
> summary(model_A)

Call:
lm(formula = one ~ two + three + factor(month), data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-133.817  -55.636    3.329   56.768  126.772 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1143.95365   54.99724  20.800   <2e-16 ***
two                -0.09670    0.06621  -1.460   0.1455    
three               0.02446    0.06666   0.367   0.7141    
factor(month)Aug    3.51788   22.09948   0.159   0.8737    
factor(month)Dec    5.60192   22.41204   0.250   0.8029    
factor(month)Feb  -22.62460   22.10889  -1.023   0.3072    
factor(month)Jan  -13.89553   22.25117  -0.624   0.5329    
factor(month)Jul  -38.85320   22.13980  -1.755   0.0806 .  
factor(month)Jun  -14.09355   22.18707  -0.635   0.5259    
factor(month)Mar   -0.45055   22.13638  -0.020   0.9838    
factor(month)May   -7.58935   22.14137  -0.343   0.7321    
factor(month)Nov  -14.75156   22.27288  -0.662   0.5084    
factor(month)Oct  -26.20290   22.09416  -1.186   0.2369    
factor(month)Sep   -4.53159   22.26334  -0.204   0.8389    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared:  0.04381,   Adjusted R-squared:  -0.01119 
F-statistic: 0.7966 on 13 and 226 DF,  p-value: 0.6635

> #and also do the same without intercept if so needed
> model_B <- lm(one ~ 0 + two + three + factor(month), data=dat)
> summary(model_B)

Call:
lm(formula = one ~ 0 + two + three + factor(month), data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-133.817  -55.636    3.329   56.768  126.772 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
two                -0.09670    0.06621  -1.460    0.146    
three               0.02446    0.06666   0.367    0.714    
factor(month)Apr 1143.95365   54.99724  20.800   <2e-16 ***
factor(month)Aug 1147.47152   54.57201  21.027   <2e-16 ***
factor(month)Dec 1149.55556   53.52499  21.477   <2e-16 ***
factor(month)Feb 1121.32904   55.18864  20.318   <2e-16 ***
factor(month)Jan 1130.05812   52.79625  21.404   <2e-16 ***
factor(month)Jul 1105.10045   54.94940  20.111   <2e-16 ***
factor(month)Jun 1129.86010   53.85865  20.978   <2e-16 ***
factor(month)Mar 1143.50310   53.59603  21.336   <2e-16 ***
factor(month)May 1136.36429   53.38218  21.287   <2e-16 ***
factor(month)Nov 1129.20208   53.54934  21.087   <2e-16 ***
factor(month)Oct 1117.75075   55.35703  20.192   <2e-16 ***
factor(month)Sep 1139.42205   53.58611  21.263   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared:  0.9964,    Adjusted R-squared:  0.9961 
F-statistic:  4409 on 14 and 226 DF,  p-value: < 2.2e-16

这使您可以对面板数据运行常规 OLS 回归。

【讨论】:

以上是关于R计算具有奇点的lm模型的稳健标准误差(vcovHC)的主要内容,如果未能解决你的问题,请参考以下文章

Stata和R中Logit回归的不同稳健标准误差

空间误差模型中的稳健标准误差

Tobit模型,具有白色标准误差的回归

如何在R中实现对空间面板数据的LM检验

R语言回归模型残差标准误差计算实战(Residual Standard Error):计算残差标准误残差标准误解读

R语言使用lm构建线性回归模型并将目标变量对数化实战:模型训练集和测试集的残差总结信息(residiual summary)模型训练(测试)集自由度计算模型训练(测试)集残差标准误计算