从大型数据集重新采样的 Diff-in-diff 估计

Posted

技术标签:

【中文标题】从大型数据集重新采样的 Diff-in-diff 估计【英文标题】:Diff-in-diff estimation with resampling from large dataset 【发布时间】:2019-05-25 02:43:31 【问题描述】:

我有一个大型数据集,可以在其上执行 diff-in-diff 估计。鉴于数据集的性质,我的 t 统计量分母被夸大了,并且系数(偷偷地)具有统计显着性。 我想逐步减少数据库中元素的数量,并为每一步重新采样大量次数并重新估计每次交互系数和标准误差。

然后我想获取所有平均估计值和标准误差,并将它们绘制在图表上,以显示在什么点(如果有)它们在统计上与零没有差异。

我的代码后面是一个玩具示例。

我不确定这是解决问题的最有效方法 我无法检索并因此绘制置信区间 鉴于存在不同的群体,我不确定抽样是否具有代表性。

玩具示例(Creds Torres-Reyna - ‎2015)

library(foreign)
library(dplyr)
library(ggplot2)


df_0 <- NULL
for (i in 1:length(seq(5,nrow(mydata)-1,5)))
 index <- seq(5,nrow(mydata),5)[i]
 df_1 <- NULL
 for (j in 1:10)

  mydata_temp <- mydata[sample(nrow(mydata), index), ]    

  didreg = lm(y ~ treated + time + did, data = mydata_temp)
  out <- summary(didreg)
  new_line <- c(out$coefficients[,1][4], out$coefficients[,2][4], index)
  new_line <- data.frame(t(new_line))
  names(new_line) <- c("c","s","i")
  df_1 <- rbind(df_1,new_line)
  
 df_0 <- rbind(df_0,df_1)


df_0 <- df_0 %>% group_by(i) %>% summarise(coefficient <- mean(c, na.rm = T),
                                          standard_error <- mean(s, na.rm = T)) 

names(df_0) <- c("i","c","s")
View(df_0)

【问题讨论】:

我因此解决了 SE 问题:out &lt;- summary(didreg) new_line &lt;- c(out$coefficients[,1][4], out$coefficients[,2][4]) ... 仍然不确定其余的 【参考方案1】:

考虑以下使用基本 R 函数重构的代码:within%in%、嵌套的lapplysetNamesaggregatedo.call。这种方法避免了在循环中调用 rbind 并紧凑地重写代码,而无需经常使用 $ 列引用。

library(foreign)

mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")

mydata <- within(mydata, 
  time <- ifelse(year >= 1994, 1, 0)
  treated <- ifelse(country %in% c("E", "F", "G"), 1, 0)
  did <- time * treated
)

# OUTER LIST OF DATA FRAMES
df_0_list <- lapply(1:length(seq(5,nrow(mydata)-1,5)), function(i)       
  index <- seq(5,nrow(mydata),5)[i]

  # INNER LIST OF DATA FRAMES  
  df_1_list <- lapply(1:100, function(j)         
    mydata_temp <- mydata[sample(nrow(mydata), index), ]    

    didreg <- lm(y ~ treated + time + did, data = mydata_temp)
    out <- summary(didreg)
    new_line <- c(out$coefficients[,1][4], out$coefficients[,2][4], index)
    new_line <- setNames(data.frame(t(new_line)), c("c","s","i"))
  )

  # APPEND ALL INNER DFS
  df <- do.call(rbind, df_1_list)
  return(df)
)

# APPEND ALL OUTER DFS
df_0 <- do.call(rbind, df_0_list)

# AGGREGATE WITH NEW COLUMNS
df_0 <- within(aggregate(cbind(c, s) ~ i, df_0, function(x) mean(x, na.rm=TRUE)),  
               upper = c + s 
               lower = c - s 
        )

# RUN PLOT
within(df_0, 
  plot(i, c, ylim=c(min(c)-5000000000, max(c)+5000000000), type = "l",
       cex.lab=0.75, cex.axis=0.75, cex.main=0.75, cex.sub=0.75)
  polygon(c(i, rev(i)), c(lower, rev(upper)),
          col = "grey75", border = FALSE)
  lines(i, c, lwd = 2)
)

【讨论】:

【参考方案2】:

最后我是这样解决的: 这是最有效的方法吗?

library(foreign)
library(dplyr)

mydata = read.dta("http://dss.princeton.edu/training/Panel101.dta")
mydata$time = ifelse(mydata$year >= 1994, 1, 0)
mydata$treated = ifelse(mydata$country == "E" |
                      mydata$country == "F" |
                      mydata$country == "G", 1, 0)
mydata$did = mydata$time * mydata$treated


df_0 <- NULL
for (i in 1:length(seq(5,nrow(mydata)-1,5)))
  index <- seq(5,nrow(mydata),5)[i]
  df_1 <- NULL
  for (j in 1:100)

    mydata_temp <- mydata[sample(nrow(mydata), index), ]    

    didreg = lm(y ~ treated + time + did, data = mydata_temp)
    out <- summary(didreg)
    new_line <- c(out$coefficients[,1][4], out$coefficients[,2][4], index)
    new_line <- data.frame(t(new_line))
    names(new_line) <- c("c","s","i")
    df_1 <- rbind(df_1,new_line)
  
  df_0 <- rbind(df_0,df_1)


df_0 <- df_0 %>% group_by(i) %>% summarise(c = mean(c, na.rm = T), s = 
mean(s, na.rm = T))
df_0 <- df_0 %>% group_by(i) %>% mutate(upper = c+s, lower = c-s)

df <- df_0
plot(df$i, df$c, ylim=c(min(df_0$c)-5000000000, max(df_0$c)+5000000000), type = "l")

polygon(c(df$i,rev(df$i)),c(df$lower,rev(df$upper)),col = "grey75", border = FALSE)
lines(df$i, df$c, lwd = 2)

【讨论】:

建议不要在循环中使用rbindcbind,因为它会在内存中进行二次复制。在循环外建立一个数据帧列表到rbind 我知道,但随机抽样可以解决问题。我每次都会得到不同的结果。

以上是关于从大型数据集重新采样的 Diff-in-diff 估计的主要内容,如果未能解决你的问题,请参考以下文章

重新采样和合并数据集

Python - Pandas,重新采样数据集以具有平衡的类

在 BigQuery 中订购大型时间序列数据集以进行导出

Pandas Dataframe 时间序列重新采样,如何修改 bin 以适应底层数据集的开始和结束时间

对于大型数据库,从 Impala 采样的最佳查询是啥?

为本地开发采样一个大型 CouchDB 数据库,避免长视图构建