Multiple Imputation

Introduction

Missing data is a massive topic. I’m only going to scratch the surface here, in terms of the computational issues surrounding multiple imputation. For further information, I highly recommend Rod Little’s Statistical Analysis With Missing Data course in the Biostats department.

Types of missing data

Missing data can be classified into one of three categories

MCAR

Data which is Missing Completely At Random has nothing systematic about which observations are missing values. There is no relationship between missingness and either observed or unobserved covariates.

MAR

Missing At Random is weaker than MCAR. The missingness is still random, but due entirely to observed variables. For example, those from a lower socioeconomic status may be less willing to provide salary information (but we know their SES status). The key is that the missingness is not due to the values which are not observed. MCAR implies MAR but not vice-versa.

MNAR

If the data are Missing Not At Random, then the missingness depends on the values of the missing data. Censored data falls into this category. For example, individuals who are heavier are less likely to report their weight. Another example, the device measuring some response can only measure values above .5. Anything below that is missing.

Determing which type of missingness

Unfortunately, there is no statistical way to determine which category your data will fall. MCAR and MAR are assumptions that need to be made based upon your knowledge of the data and its collection mechanisms.

There is nothing in the data that can inform MNAR - If the missingness is due to the unobserved values, there is nothing in the data that can help us. MNAR should however be the default assumption.

To gain some insight, you can fit a logistic model predicting missinginess based upon the covariates. If you find some relationships, that implies that MAR is more realistic than MCAR. However, it does not guarantee that we don’t have MNAR instead! Similarly, if we find no relationships, it may be because

  • The missingness is MCAR.
  • The missingness is MNAR.
  • The missingness is really MAR, but you haven’t modeled it appropriately.

Methods to handle missingness

If the data is MCAR:

  • Complete case analysis is valid.
  • Mulitple imputation or any other imputation method is valid.

If the data is MAR:

  • Some complete cases analyses are valid under weaker assumptions than MCAR.
    • E.g. linear regression is unbiased if missingness is independent of the response, conditional on the predictors.
  • Multiple imputation is valid (it is biased, but the bias is negligible).

If the data is MNAR:

  • You must model the missingness explicitly; jointly modeling the response and missingness.
  • In some specific cases (e.g. survival analysis), MNAR data (e.g. censored data) is handled appropriately.
  • Generally, we assume MAR whenever possible just to avoid this situation.

Multiple Imputation

The general procedure for Multiple Imputation is straightforward

  1. Impute the missing values with values randomly drawn from some distributions to generate \(m\) complete cases data sets.
  2. Perform the same analysis on each of the \(m\) data sets.
  3. Pool the results in some fashion.

For the first step, there are variations. One of the simplest to understand involves a series of regression models predicting each variable with missingness using all other variables (those with missingness and those without). Based upon that model, we can obtain a distribution for the expected value for each missing value and sample from it.

The exact method used in step 3 depends on the analysis from step 2. Generally, the pooled estimated statistic is simply the average of the statistic from each of the \(m\) data sets, and the standard error is where any complications may arise.

mice in R

In R, we can use the mice (Multiple Imputation with Chained Equations) to perform multiple imputation and the subsequent analysis. The three steps above are performed via the mice, with and pool functions respectively.

mice

The mice function performs the imputation via chained equations. Without going too heavily into the details, chained equations are a variation of a Gibbs Sampler (an MCMC approach) that iterates between drawing estimates of missing values and estimates of parameters for distribution of the variable (both conditional on the other variables). Chained equations have fast convergence compared to most MCMC due to conditional independence of \(Y_t\) and \(Y_{t-1}\).

The default number of imputations, controlled by the m argument, is 5. This comes from the original work back in the 80’s and 90’s from a closed form estimate of the “efficiency” of the estimate - with 25% missing data and 5 imputations, efficiency is over 95%.

However, some modern work (see for example here) makes the argument that we actually do gain more by performing more imputations, and the computational demands that were an issue when the original papers were published are obviously no longer a concern. These authors suggest using a larger number like m = 100.

In practice, the current standard remains (in all the fields I’m acquainted with) m = 5.

The other major argument to be concerned with, method, controls how each imputed value is computed. While all use a chained equation approach, the details differ greatly. Without diving into the details too much, it suffices to say there are quite a few, which are documented well in the original mice paper. For our purposes, we’ll stick with the default, pmm (predictive mean matching), which has the benefits that it is semi-parametric in a sense (non-linear relationships are preserved even if the imputation model is misspecified) and imputed values are restricted to observed values, so we cannot impute “impossible” values.

library(mice)
data(nhanes)
nhanes.imp <- mice(nhanes, m = 5, method = 'pmm')
##
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

We can examine the imputations using the following:

# Show imputation 3
head(complete(nhanes.imp, 3))
##   age  bmi hyp chl
## 1   1 33.2   2 229
## 2   2 22.7   1 187
## 3   1 28.7   1 187
## 4   3 25.5   2 184
## 5   1 20.4   1 113
## 6   3 22.5   1 184

# Show just the imputed values. Columns are imputations, rows are observations
nhanes.imp$imp
## $age
## NULL
##
## $bmi
##       1    2    3    4    5
## 1  27.4 28.7 33.2 29.6 27.4
## 3  29.6 30.1 28.7 29.6 25.5
## 4  30.1 22.7 25.5 20.4 20.4
## 6  25.5 21.7 22.5 22.5 27.4
## 10 27.4 30.1 26.3 27.4 28.7
## 11 30.1 30.1 27.5 22.5 30.1
## 12 28.7 22.7 22.5 28.7 27.4
## 16 29.6 33.2 35.3 35.3 35.3
## 21 22.0 28.7 22.0 27.2 33.2
##
## $hyp
##    1 2 3 4 5
## 1  1 1 2 1 2
## 4  1 2 2 2 2
## 6  2 2 1 1 2
## 10 2 1 2 1 2
## 11 1 1 1 1 1
## 12 1 2 1 1 1
## 16 1 1 1 1 2
## 21 1 1 2 1 1
##
## $chl
##      1   2   3   4   5
## 1  187 187 229 187 131
## 4  186 184 184 199 284
## 10 204 184 229 199 229
## 11 204 187 229 187 199
## 12 229 238 229 218 218
## 15 229 187 229 187 199
## 16 187 187 218 204 131
## 20 206 186 186 218 184
## 21 238 187 113 131 238
## 24 186 218 206 218 218

# Extract the "tall" matrix which stacks the imputations
nhanes.comp <- complete(nhanes.imp, "long", include = TRUE)
# Now, the .imp variable identifies which imputation each belongs to.
# 0 (included because of `include = TRUE` above) is the un-imputed data.
table(nhanes.comp$.imp)
##
##  0  1  2  3  4  5
## 25 25 25 25 25 25

# Let's visualize the distribution of imputed and observed values.
# `cci` returns logical whether its input is complete at each observation. 
nhanes.comp$bmi.NA <- cci(nhanes$bmi)
# Note that the above uses the recylcing properties of matrixes/data.frame:
#  The `cci` call returns length 25; but because values are recylced to the total
#  number of rows in nhanes.comp, it replicates 6 times.
head(nhanes.comp[, c("bmi", "bmi.NA")])
##    bmi bmi.NA
## 1   NA  FALSE
## 2 22.7   TRUE
## 3   NA  FALSE
## 4   NA  FALSE
## 5 20.4   TRUE
## 6   NA  FALSE
library(ggplot2)
ggplot(nhanes.comp,
       aes(x = .imp, y = bmi, color = bmi.NA)) + 
  geom_jitter(show.legend = FALSE,
              width = .1)
## Warning: Removed 9 rows containing missing values (geom_point).

Because we used the default method (pmm), we’re guaranteed not to have any extreme values. That isn’t the case for other methods, so this is a nice check.

with

with is a general use function in R that replicates using attach, running code, then running detach. However, when we pass a mids object (which is the output of a mice() call) to with, it because a special mice version which properly handles the imputations.

with can take any valid R command as its second argument, and applies it to each imputed data set. Some examples:

with(nhanes.imp, mean(bmi))
## call :
## with.mids(data = nhanes.imp, expr = mean(bmi))
##
## call1 :
## mice(data = nhanes, m = 5, method = "pmm")
##
## nmis :
## age bmi hyp chl
##   0   9   8  10
##
## analyses :
## [[1]]
## [1] 27.016
##
## [[2]]
## [1] 26.92
##
## [[3]]
## [1] 26.74
##
## [[4]]
## [1] 26.728
##
## [[5]]
## [1] 27.216
with(nhanes.imp, t.test(bmi ~ hyp))
## call :
## with.mids(data = nhanes.imp, expr = t.test(bmi ~ hyp))
##
## call1 :
## mice(data = nhanes, m = 5, method = "pmm")
##
## nmis :
## age bmi hyp chl
##   0   9   8  10
##
## analyses :
## [[1]]
##
##  Welch Two Sample t-test
##
## data:  bmi by hyp
## t = 0.29857, df = 22.998, p-value = 0.7679
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.944979  2.601119
## sample estimates:
## mean in group 1 mean in group 2
##        27.09474        26.76667
##
##
## [[2]]
##
##  Welch Two Sample t-test
##
## data:  bmi by hyp
## t = 1.8799, df = 18.356, p-value = 0.07609
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.314041  5.726739
## sample estimates:
## mean in group 1 mean in group 2
##        27.67778        24.97143
##
##
## [[3]]
##
##  Welch Two Sample t-test
##
## data:  bmi by hyp
## t = -0.087957, df = 20.034, p-value = 0.9308
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.543446  3.256681
## sample estimates:
## mean in group 1 mean in group 2
##        26.69412        26.83750
##
##
## [[4]]
##
##  Welch Two Sample t-test
##
## data:  bmi by hyp
## t = 0.7981, df = 8.7129, p-value = 0.446
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.560464  5.330464
## sample estimates:
## mean in group 1 mean in group 2
##          27.005          25.620
##
##
## [[5]]
##
##  Welch Two Sample t-test
##
## data:  bmi by hyp
## t = -0.19701, df = 19.212, p-value = 0.8459
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.944520  3.265354
## sample estimates:
## mean in group 1 mean in group 2
##        27.09375        27.43333
with(nhanes.imp, lm(bmi ~ age + chl))
## call :
## with.mids(data = nhanes.imp, expr = lm(bmi ~ age + chl))
##
## call1 :
## mice(data = nhanes, m = 5, method = "pmm")
##
## nmis :
## age bmi hyp chl
##   0   9   8  10
##
## analyses :
## [[1]]
##
## Call:
## lm(formula = bmi ~ age + chl)
##
## Coefficients:
## (Intercept)          age          chl
##    22.46957     -1.70340      0.03828
##
##
## [[2]]
##
## Call:
## lm(formula = bmi ~ age + chl)
##
## Coefficients:
## (Intercept)          age          chl
##    22.31127     -3.71571      0.05787
##
##
## [[3]]
##
## Call:
## lm(formula = bmi ~ age + chl)
##
## Coefficients:
## (Intercept)          age          chl
##    20.35752     -3.10569      0.06017
##
##
## [[4]]
##
## Call:
## lm(formula = bmi ~ age + chl)
##
## Coefficients:
## (Intercept)          age          chl
##    19.41133     -4.21356      0.07643
##
##
## [[5]]
##
## Call:
## lm(formula = bmi ~ age + chl)
##
## Coefficients:
## (Intercept)          age          chl
##    26.21057     -3.37081      0.03538

As an alternative, you could iterate over .imp inside nhanes.comp, performing the analysis on each. Generally I wouldn’t recommend this over with, but if necessary you could do this.

pool

Finally, for any method which produces an object having both coef() and vcov() methods (run methods(coef) or methods(vcov) to see), such as lm, aov or lme (from lme4), the pool function can combine the individual coefficient estimates into one “pooled” estimate, based upon Rubin’s rules for pooling. As mentioned before, the point estimates are the strict average, but the standard errors are more complicated.

Also useful is pool.r.squared for obtaining pooled \(R^2\) estimates of linear models.

mod1 <- with(nhanes.imp, lm(bmi ~ age + chl))
pool(mod1)
## Call: pool(object = mod1)
##
## Pooled coefficients:
## (Intercept)         age         chl
## 22.15205451 -3.22183316  0.05362695
##
## Fraction of information about the coefficients missing due to nonresponse:
## (Intercept)         age         chl
##   0.5294554   0.6847604   0.6128896
round(summary(pool(mod1)), 3)
##                est    se      t    df Pr(>|t|)  lo 95  hi 95 nmis   fmi
## (Intercept) 22.152 4.411  5.022 7.706    0.001 11.912 32.392   NA 0.529
## age         -3.222 1.360 -2.370 4.969    0.064 -6.723  0.280    0 0.685
## chl          0.054 0.026  2.051 6.126    0.085 -0.010  0.117   10 0.613
##             lambda
## (Intercept)  0.421
## age          0.579
## chl          0.504
pool.r.squared(mod1)
##           est      lo 95     hi 95       fmi
## R^2 0.4260309 0.03148215 0.7763445 0.5697666

Conclusion

I included this topic because I often run into analysts who are flummoxed when handling the multiple imputed data sets (nhanes.comp in this example) if the cookbook methods have issues. I hope that with this basic understanding of the computational methodology, you will be able to overcome such obstacles.

Finally, Stata’s multiple imputation methodology supports a wider range of pooled estimates, so if multiple imputation is a cornerstone of analysis, you may consider checking it out.

Josh Errickson