R中缺失和删失数据的多重插补

Posted

技术标签:

【中文标题】R中缺失和删失数据的多重插补【英文标题】:Multiple Imputation of missing and censored data in R 【发布时间】:2017-10-05 06:16:29 【问题描述】:

我有一个包含随机缺失 (MAR) 和删失数据的数据集。这些变量是相关的,我试图有条件地估算缺失的数据,以便我可以估计相关多元正态分布的分布参数。我想使用 Gibbs MCMC,但在实施该程序时遇到了困难。我的数据框有 5 个变量(表示为 x1:x5)、1099 个样本,其中包含 MAR、审查值和观察值的某种组合。这是我迄今为止尝试过的:

# packages
library(msm, tmvtnorm, MCMCpack)

# priors 
theta0<-c(rep(0, 5))
Sigma0<-S0<-diag(5)  
nu0<-4 

# initialize parameters
theta<-c(rep(0, 5))
Tau<-diag(5) 

# initialize output matrix
n_samples <- 1000
mu_MCMC <- matrix(0, nrow = n_samples, ncol = 5)
mu_MCMC[1,] <- theta
cov_MCMC <- matrix(0, nrow = n_samples, ncol = 25)
cov_MCMC[1,] <- c(diag(5))

# detection limits
det_lim <- matrix(c(-1.7, 0, 0, 0, 0), nrow = 1, ncol = 5)

# function to detect NaN (i.e., below detection data)
is.nan.data.frame <- function(x)
    do.call(cbind, lapply(x, is.nan))

for(i in 2:n_samples)
    imputedDF <- data.frame()
    for(r in 1:nrow(originalDF))
        # variables that are MAR or censored
        mis <- r[, is.na(r) & is.nan(r)]    
        # variables that are observed
        obs <- r[, !is.na(r)]

        # subset mu for missing, observed
        mu1 <- mu[, names(r) %in% names(mis)]
        mu2 <- mu[, names(r) %in% names(obs)]

        # calculate sigmas for MVN partitions of mis, obs
        sigma11 <- sigma[names(r) %in% names(mis), names(r) %in% names(mis)]
        sigma22 <- sigma[names(r) %in% names(obs), names(r) %in% names(obs)]
        sigma12 <- sigma[names(r) %in% names(obs), names(r) %in% names(mis)]
        sigma21 <- t(sigma12)

        # create matrix for detection limits based on missing values
        ## if NaN, use detection limit; if NA use Inf
        dl <- c(ifelse("x1" %in% names(is.nan(r)), det_lim[1, "x1"], Inf), 
                ifelse("x2" %in% names(is.nan(r)), det_lim[1, "x2"], Inf), 
                ifelse("x3" %in% names(is.nan(r)), det_lim[1, "x3"], Inf), 
                ifelse("x4" %in% names(is.nan(r)), det_lim[1, "x4"], Inf), 
                ifelse("x5" %in% names(is.nan(r)), det_lim[1, "x5"], Inf))

        # compute mu, sigma to use for conditional MVN
        ## if all values are missing
        if(length(names(obs) == 0) 
            mu_mis <- mu1
            sigma_mis <- sigma11
        ## otherwise
             else 
                mu_mis <- mu1 + sigma12 %*% solve(sigma22) * (obs - t(mu2))
                sigma_mis <- sigma11 - sigma12 %*% solve(sigma22) %*% sigma21
        

        # imputation
        ## if all data are observed, missing is empty
        if(length(obs) == 0) 
            mis_impute <- data.frame()
        ## only need to impute a single value
             else if(length(names(mis)) == 1)        
                  mis_impute <- rtnorm(1, mean = mu_mis, sd = sigma_mis, lower = -Inf, upper = dl)
        ## have more than one missing value         
                   else 
                      mis_impute <- rtmvnorm(1, mean = mu_mis, sigma = sigma_mis, lower = rep(-Inf, length = length(names(mis))), upper = dl)
        

       # merge observed values with simulated 
       ## if all values observed   
       if(length(names(mis)) == 0) 
           sim_result <- obs
            else 
                 sim_result <- cbind(mis_impute, obs) 
       

       imputedDF <- rbind(imputedDF, sim_result)
    

    # update theta
    v <- solve(solve(Sigma0) + nrow(sim_result)*Tau)
    m <- v %*% (solve(Sigma0) %*% theta0 + Tau %*% apply(sim_result,2,sum))
    mu <- as.data.frame(rmvnorm(1,m,v))
    mu_MCMC[i,] <- mu

    # update Sigma
    tmp <- t(sim_result) - mu
    Tau <- rwish(nu0 + nrow(sim_result), solve(S0 + t(tmp) %*% tmp)) 
    sigma <- matrix(c(solve(Tau)), nrow = 5, ncol = 5, byrow = TRUE)
    cov_MCMC[i,] <- c(solve(Tau))

我一直遇到错误,因为插补返回 NaN 和 NA 值,但我不知道出了什么问题,因为当我仅使用内部循环来插补数据时,它似乎可以工作。因此,问题似乎是参数更新,但我无法弄清楚!

【问题讨论】:

您能提供您的originalDF 数据框吗?例如粘贴dput(head(originalDF))的输出? 另外,这一行在左大括号之前缺少一个右括号(我自己无法编辑,否则我会):if(length(names(obs) == 0) 不幸的是,原来的帖子已经有 2 年的历史了——所以我怀疑我们能不能给你原始的 df。 用户似乎在过去两年没有在线。但是你作为 bounty.giver 可以提供一个接近你的要求的样本数据集@JamesThomasDurant?否则你也没有可重现的代码,我猜,.. 此代码无法使用,我怀疑其中任何一个都有效。内部循环以for(r in 1:nrow(originalDF)) 开始,将一个整数分配给r。然后尝试使用mis &lt;- r[, is.na(r) &amp; is.nan(r)] 创建缺失子集——回想一下r 包含一个整数,而不是一行。然后是一系列ifelses,条件为"x1" %in% names(is.nan(r)),其中包含三个问题:1)再次,r 是整数而不是行,2)names(is.nan(r)) 返回NULL,3)@ 987654334@ 确实有效,但它返回所有名称,因此条件始终为 TRUE。 【参考方案1】:

序言:

我的感觉是,这里的部分问题是我们没有一个好的示例数据集可以解决。

我的感觉是,我们可以通过创建示例数据集来构建解决方案讨论来解决这个问题。为此,一个有用的包是Wakefield 包,它允许创建模拟数据集。

例如,我们可能会创建一个包含 2000 人的数据集,其中缺少一些年龄、性别、就业状况、教育数据和婚姻状况信息。

插补

核心问题是我们能否从数据集中的其他数据中推断年龄或性别?

例如,如果我们不知道某人的年龄,我们可以根据他们的婚姻状况、就业类型和/或他们的教育水平来推断吗?在非常简单的层面上,我们可以简单地搜索年龄为 NA 的条目,然后查看婚姻状况。如果婚姻状况是“已婚”,那么我们推测我们的数据集是针对美国人的,并查看平均结婚年龄并替换为已婚人士的估计年龄。

我们可以对此进行扩展,并通过考虑更多变量来使我们的估计更加准确。例如,我们可能会同时查看婚姻状况、教育水平和就业状况,以进一步改进我们的年龄估计。如果一个人已婚,拥有博士学位并退休,我们会将年龄推高。如果一个人是单身,一个学生我们把年龄推低。除此之外,我们可以查看数据集中的年龄分布,以估算有关缺失值的数据。

生成示例数据集。

# packages

requiredPackages <- c("wakefield", "dplyr", "BaylorEdPsych", "mice", "MCMCpack")

ipak <- function(pkg) 
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) 
    install.packages(new.pkg, dependencies = TRUE)
  
  sapply(pkg, require, character.only = TRUE)


ipak(requiredPackages)

# generate some data for Males with a 8% missing value for
# age

set.seed(10)
f.df <- r_data_frame(n = 1000, age,
                    gender(x = c("M", "F"), prob = c(0, 1), name = "Gender"),
                    employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
                    education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
                                    "Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
                                    "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
                                    prob = c(0.013, 0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072, 0.019, 0.012), name = "Education"),
                    marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.05)

# str(f.df)
summary(f.df)
set.seed(20)

# generate some data for Females with a 5% missing value for
# age

m.df <- r_data_frame(n = 1000, age,
                     gender(x = c("M", "F"), prob = c(1, 0), name = "Gender"),
                     employment(x = c("Full Time", "Part Time", "Unemployed", "Retired", "Student"), prob = c(0.6, 0.1, 0.1, 0.1, 0.1), name = "Employment"),
                     education(x = c("No Schooling Completed", "Nursery School to 8th Grade", "9th Grade to 12th Grade, No Diploma",
                                      "Regular High School Diploma", "GED or Alternative Credential", "Some College, Less than 1 Year", "Some College, 1 or More Years, No Degree",
                                      "Associate's Degree", "Bachelor's Degree", "Master's Degree", "Professional School Degree", "Doctorate Degree"),
                                      prob = c(0.013,0.05, 0.085, 0.246, 0.039, 0.064, 0.15, 0.075, 0.176, 0.072,0.019, 0.012), name = "Education"),
                     marital(x = c("Married", "Divorced", "Widowed", "Separated", "Never Married"), prob = NULL, name = "Marital")) %>% r_na(cols = 1 - 3, prob = 0.03)

summary(m.df)

all.df = rbind.data.frame(m.df, f.df)

summary(all.df)

数据汇总

> summary(all.df)
      Age        Gender        Employment                                      Education            Marital   
 Min.   :18.00   M:1000   Full Time :1142   Regular High School Diploma             :459   Married      :394  
 1st Qu.:35.00   F:1000   Part Time : 207   Bachelor's Degree                       :356   Divorced     :378  
 Median :54.00            Unemployed: 193   Some College, 1 or More Years, No Degree:284   Widowed      :411  
 Mean   :53.76            Retired   : 182   9th Grade to 12th Grade, No Diploma     :156   Separated    :379  
 3rd Qu.:72.00            Student   : 196   Associate's Degree                      :145   Never Married:358  
 Max.   :89.00            NA's      :  80   (Other)                                 :520   NA's         : 80  
 NA's   :80                                 NA's                                    : 80                      
> 

数据是完全随机丢失还是不随机丢失?

# Test for MCAR - Missing at Completely at Random...
test_mcar <- LittleMCAR(all.df)
print(test_mcar$amount.missing)
print(test_mcar$p.value)

控制台输出

> # Test for MCAR - Missing at Completely at Random...
> test_mcar <- LittleMCAR(all.df)
this could take a while> print(test_mcar$amount.missing)
                  Age Gender Employment Education Marital
Number Missing  80.00      0      80.00     80.00   80.00
Percent Missing  0.04      0       0.04      0.04    0.04
> print(test_mcar$p.value)
[1] 0.02661428

数据插补

好的,我们先来看一下缺失值的分布。我们可以运行mice::md.pattern() 函数,以显示缺失值在数据框中其他列上的分布。 md.pattern() 函数输出对于建议哪些变量可能是用于填补缺失值的良好候选者很有用:

> md.pattern(all.df)
     Gender Age Employment Education Marital    
1696      1   1          1         1       1   0
73        1   1          1         1       0   1
73        1   1          1         0       1   1
2         1   1          1         0       0   2
71        1   1          0         1       1   1
3         1   1          0         1       0   2
2         1   1          0         0       1   2
71        1   0          1         1       1   1
2         1   0          1         1       0   2
3         1   0          1         0       1   2
4         1   0          0         1       1   2
          0  80         80        80      80 320

好的,我们现在可以开始估算缺失值了:

imp <- mice(all.df, m = 5, maxit = 50, seed = 1234, printFlag = FALSE)
m=5 参数指定您最终得到变量的五个可能的插补 maxit=50 参数指定算法在收敛到一个解之前最多有 50 次迭代,并且可以向上或向下调整到所需的精度

mice() 函数可能需要一段时间,具体取决于我们指定的迭代次数。在这种情况下,完成后,我们可以使用 head() 函数看到 Age 的一些估算值:

head(imp$imp$Age)
     1  2  3  4  5
7   28 49 37 70 89
33  55 54 52 88 24
56  89 83 68 71 61
84  43 43 24 30 31
96  28 64 89 41 50
120 47 34 36 22 77

要真正完成插补,我们必须运行complete() 函数并将结果分配给一个新的数据帧。此版本的complete() 函数将通过“long”参数收集指定数据帧中的所有插补:

all_imputed_df <- complete(imp, "long", include = TRUE)
table(all_imputed_df$.imp, is.na(all_imputed_df$Age))

控制台:

> all_imputed_df <- complete(imp, "long", include = TRUE)
> table(all_imputed_df$.imp, is.na(all_imputed_df$Age))

    FALSE TRUE
  0  1920   80
  1  2000    0
  2  2000    0
  3  2000    0
  4  2000    0
  5  2000    0

现在我们有一个包含 12000 个年龄值的数据集,涵盖 5 个年龄输入值。

让我们尝试使用插补 #3 进行回归。

首先,提取 impute #3

impute.3 <- subset(all_imputed_df,.imp=='3')  
summary(impute.3)

控制台:

> impute.3 <- subset(all_imputed_df, .imp == "3")
> summary(impute.3)
      .imp        .id              Age        Gender        Employment  
 Min.   :3   Min.   :   1.0   Min.   :18.00   M:1000   Full Time :1192  
 1st Qu.:3   1st Qu.: 500.8   1st Qu.:35.00   F:1000   Part Time : 211  
 Median :3   Median :1000.5   Median :54.00            Unemployed: 202  
 Mean   :3   Mean   :1000.5   Mean   :53.89            Retired   : 191  
 3rd Qu.:3   3rd Qu.:1500.2   3rd Qu.:72.00            Student   : 204  
 Max.   :3   Max.   :2000.0   Max.   :89.00                             

                                    Education            Marital   
 Regular High School Diploma             :478   Married      :416  
 Bachelor's Degree                       :376   Divorced     :390  
 Some College, 1 or More Years, No Degree:295   Widowed      :425  
 9th Grade to 12th Grade, No Diploma     :168   Separated    :393  
 Associate's Degree                      :150   Never Married:376  
 Master's Degree                         :141                      
 (Other)                                 :392 

现在我们可以运行线性回归了:

> lm(Age ~ Education + Gender + Employment + Marital, data = impute.3)

Call:
lm(formula = Age ~ Education + Gender + Employment + Marital, 
    data = impute.3)

Coefficients:
                                      (Intercept)               EducationNursery School to 8th Grade  
                                          51.6733                                             1.4100  
     Education9th Grade to 12th Grade, No Diploma               EducationRegular High School Diploma  
                                           1.3675                                             0.7611  
           EducationGED or Alternative Credential            EducationSome College, Less than 1 Year  
                                           1.0365                                            -2.6069  
EducationSome College, 1 or More Years, No Degree                        EducationAssociate's Degree  
                                           0.3563                                             0.9506  
                       EducationBachelor's Degree                           EducationMaster's Degree  
                                           1.2505                                            -1.6372  
              EducationProfessional School Degree                          EducationDoctorate Degree  
                                           1.1774                                             0.4936  
                                          GenderF                                EmploymentPart Time  
                                          -0.3190                                             1.1316  
                             EmploymentUnemployed                                  EmploymentRetired  
                                           3.1622                                            -0.6855  
                                EmploymentStudent                                    MaritalDivorced  
                                           3.0850                                             0.2934  
                                   MaritalWidowed                                   MaritalSeparated  
                                           2.3162                                             1.6833  
                             MaritalNever Married  
                                           1.6169  

MCMCRegress

library(MCMCpack) # b0 = prior mean, B0 = prior precision = 1/variance
fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
summary(fitBayes)

控制台输出

> fitBayes <- MCMCregress(Age ~ Education + Gender + Employment + Marital, data = impute.3, mcmc = 10000, seed = 1234, b0 = 0, B0 = 0.01, drop.unused.levels = TRUE)
> summary(fitBayes)

Iterations = 1001:11000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

                                                       Mean      SD Naive SE Time-series SE
(Intercept)                                        48.67377  2.5337 0.025337       0.025337
EducationNursery School to 8th Grade                3.77088  3.0514 0.030514       0.030514
Education9th Grade to 12th Grade, No Diploma        3.81009  2.7794 0.027794       0.027794
EducationRegular High School Diploma                3.24531  2.4933 0.024933       0.025412
EducationGED or Alternative Credential              3.38733  3.2155 0.032155       0.032155
EducationSome College, Less than 1 Year            -0.08419  2.9104 0.029104       0.029577
EducationSome College, 1 or More Years, No Degree   2.82889  2.6092 0.026092       0.026092
EducationAssociate's Degree                         3.32932  2.8410 0.028410       0.028410
EducationBachelor's Degree                          3.72272  2.5228 0.025228       0.025659
EducationMaster's Degree                            0.87738  2.8611 0.028611       0.028611
EducationProfessional School Degree                 3.27542  4.0199 0.040199       0.040199
EducationDoctorate Degree                           2.43794  4.5996 0.045996       0.045996
GenderF                                            -0.11321  0.9327 0.009327       0.009327
EmploymentPart Time                                 1.25556  1.5756 0.015756       0.016170
EmploymentUnemployed                                3.27395  1.6213 0.016213       0.015708
EmploymentRetired                                  -0.52614  1.6394 0.016394       0.016394
EmploymentStudent                                   3.17027  1.6058 0.016058       0.016889
MaritalDivorced                                     0.72379  1.4715 0.014715       0.014715
MaritalWidowed                                      2.73130  1.4394 0.014394       0.014706
MaritalSeparated                                    2.10423  1.4608 0.014608       0.014608
MaritalNever Married                                2.00781  1.4960 0.014960       0.014960
sigma2                                            448.01488 14.0715 0.140715       0.140715

2. Quantiles for each variable:

                                                       2.5%      25%      50%      75%   97.5%
(Intercept)                                        43.75477  46.9556  48.6619  50.3967  53.609
EducationNursery School to 8th Grade               -2.19290   1.7079   3.7701   5.8216   9.718
Education9th Grade to 12th Grade, No Diploma       -1.59323   1.9586   3.8326   5.6676   9.349
EducationRegular High School Diploma               -1.61001   1.5641   3.2474   4.9296   8.155
EducationGED or Alternative Credential             -2.88523   1.2095   3.4173   5.5405   9.691
EducationSome College, Less than 1 Year            -5.75364  -2.0617  -0.1009   1.8986   5.614
EducationSome College, 1 or More Years, No Degree  -2.28754   1.0853   2.8608   4.5718   7.895
EducationAssociate's Degree                        -2.27611   1.4311   3.3285   5.2330   8.978
EducationBachelor's Degree                         -1.21780   2.0258   3.7275   5.4203   8.655
EducationMaster's Degree                           -4.61270  -1.0872   0.8601   2.8484   6.456
EducationProfessional School Degree                -4.63027   0.5900   3.2767   5.9475  11.059
EducationDoctorate Degree                          -6.47767  -0.6371   2.4553   5.4188  11.705
GenderF                                            -1.95673  -0.7298  -0.1067   0.4903   1.727
EmploymentPart Time                                -1.82784   0.1849   1.2597   2.3160   4.354
EmploymentUnemployed                                0.09335   2.1988   3.2674   4.3557   6.433
EmploymentRetired                                  -3.80162  -1.6316  -0.5147   0.5953   2.706
EmploymentStudent                                   0.03387   2.0713   3.1502   4.2227   6.342
MaritalDivorced                                    -2.15073  -0.2732   0.7249   1.7266   3.602
MaritalWidowed                                     -0.13488   1.7817   2.7367   3.6961   5.567
MaritalSeparated                                   -0.76396   1.1177   2.1118   3.0700   5.001
MaritalNever Married                               -0.92230   0.9950   1.9976   3.0248   4.898
sigma2                                            420.98019 438.4621 447.7222 457.2730 476.481

希望上述观察指向正确的方向。

引文:

R 包:Mice - Multivariate Imputation by Chained Equations Reference Manual 作者:Stef van Buuren Flexible Imputation of Missing Data 作者:Stef van Buuren(在线书籍) Practical Predictive Analytics 作者:拉尔夫·温特斯 Simulation for Data Science with R 作者:Matthias Templ Bayesian Data Analysis, Third Edition,第 3 版作者:Andrew Gelman;约翰·B·卡林;哈尔·S·斯特恩;大卫·B·邓森;阿基维塔里;唐纳德 B. 鲁宾

【讨论】:

非常有帮助,谢谢!在数据也被审查的情况下使用插补怎么样?例如。我只是我的一些数据低于给定限制,或高于给定限制,或在两个值之间? 感谢您的帮助!使用鼠标或 MCMCreg 来输入审查数据(不丢失)怎么样? James - 不用担心,乐于助人。鼠标可用于从其他变量中获取隐藏/审查数据,这是一个非常强大的软件包。该领域的大部分工作源自 Fritz Scheurens 的工作,旨在解决当前人口调查 (CPS) 的 3 月收入补充资料中缺少收入数据的实际问题。 Stef van Buuren 在他的在线书籍“缺失数据的灵活插补”中涵盖了故意缺失数据的主题和历史 - 请参阅帖子中的链接。请参阅第 4 节多元缺失数据和第 6.3 节派生变量。 这里需要注意的一点是您需要验证您的数据关系。我没有你的数据集。这样想; X、Y 和 Z 之间是否存在关系?小鼠将允许您评估您的数据预测因子,从而评估您正确导出缺失数据的能力。请参阅第 6.3.2 节关于该主题的内容。我为什么要提?有时您会遇到无法通过设计准确导出缺失值的数据集。我的目标是尝试在示例的上下文中向您指出推荐阅读。 ¯_(ツ)_/¯ @JamesThomasDurant 我解决了你的问题吗?即,您应该能够将经过审查的数据归咎于测试,以证明您的数据集中的数据存在关系。

以上是关于R中缺失和删失数据的多重插补的主要内容,如果未能解决你的问题,请参考以下文章

多重插补为啥要汇总分析

拓端tecdat|R语言编程指导用线性模型进行臭氧预测: 加权泊松回归,普通最小二乘,加权负二项式模型,多重插补缺失值

Matlab:对缺失数据的多重插补

绘制多重插补结果

单元无回答的缺失数据处理方法

如何使用 R 中基于面板数据的客户 ID 的所有列的中值插补填充缺失值?