Multiple imputation is known to be advantageous compared to single imputation. However, in practice there are often non-statistical reasons why multiple imputation can not be used (e.g. the data recipient isn't educated enough to deal with multiple data sets; the boss of a company doesn't want it; ...).
Predictive mean matching (PMM) is a bayesian imputation method that is known to be one of the best imputation methods, when multiple imputation can be applied. Now, I am wondering whether PMM could also be used as single imputation method, when multiple imputation is not possible.
Consider the following example:
- You have a typical data set including a continous variable (e.g. income / age) with missing values.
- The data can be imputed only once (single imputation).
- You have to choose between typical single imputation methods (e.g. stochastic regression imputation / hot deck imputation) OR predictive mean matching.
Illustration in R:
# Create some example data
set.seed(95159) # Seed
N <- 20000 # Sample size
y <- rnorm(N) # Target variable
x1 <- rnorm(N) + 0.1 * y # Some auxiliary variables
x2 <- rnorm(N) - 0.05 * y
x3 <- rnorm(N) + 0.2 * x1 + 0.3 * x2
y[rbinom(N, 1, 0.1) == 1] <- NA # 10% missings in y
data <- data.frame(y, x1, x2, x3) # Create data set
# Impute data via stochastic regression imputation
# and predictive mean matching
library("mice") # Load mice package
imp_sri <- mice(data, m = 1, method = "norm.nob")
imp_pmm <- mice(data, m = 1, method = "pmm")
data_sri <- complete(imp_sri)
data_pmm <- complete(imp_pmm)
# Compare results
plot(density(data$y, na.rm = TRUE), xlab = "y",
main = "Comparison Observed & Imputed)
points(density(data_sri$y[is.na(data$y)]), typ = "l", col = 2)
points(density(data_pmm$y[is.na(data$y)]), typ = "l", col = 3)
legend("topleft",
c("Observed", "Stochastic Regression", "Predictive Mean Matching"),
col = c("black", "red", "green"), lty = 1)
The simplified example doesn't reveal a substantial difference between stochastic regression imputation and single predictive mean matching. However, in more complex data scenarios (e.g. heteroscedastic data), predictive mean matching is usually advantageous.
Question: If multiple imputation can not be used, is there any reason, why single predictive mean matching should not be used instead of typical single imputation methods such as stochastic regression imputation?