13

I would like to use imputation for replacing missing values in my data set under certain constraints.

For example, I'd like the imputed variable x1 to be greater or equal to the sum of my two other variables, say x2 and x3. I also want x3 to be imputed by either 0 or >= 14 and I want x2 to be imputed by either 0 or >= 16.

I tried defining these constraints in SPSS for multiple imputation, but in SPSS I can only define maximum and minimum values. Is there any way to define further constraints in SPSS or do you know any R package which would let me define such constraints for imputation of missing values?

My data is as follows:

   x1 =c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24)
   x2 = c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0)
   x3 = c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0)
   dat=data.frame(x1=x1, x2=x2, x3=x3)
   > dat
       x1 x2 x3
   1   21  0  0
   2   50 NA  0
   3   31 18  0
   4   15  0  0
   5   36 19  0
   6   82  0 54
   7   14 NA  0
   8   14  0  0
   9   19  0  0
   10  18  0  0
   11  16  0  0
   12  36  0  0
   13 583  0  0
   14  NA NA NA
   15  NA NA NA
   16  NA NA NA
   17  50 22 NA
   18  52 NA  0
   19  26  0  0
   20  24  0  0
t0x1n
  • 185
  • 1
  • 11
rose
  • 493
  • 1
  • 4
  • 14
  • I changed `0 or 16 or >= 16` to `0 or >= 16` since `>=16` includes the value `16`. Hope that did not mess your meaning up. Same for `0 or 14 or >= 14` – Alexis Nov 22 '14 at 20:50

5 Answers5

16

One solution is to write your own custom imputation functions for the mice package. The package is prepared for this and the setup surprisingly pain-free.

First we setup the data as suggested:

dat=data.frame(x1=c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24), 
               x2=c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0), 
               x3=c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0))

Next we load the micepackage and see what methods it choose by default:

library(mice)
# Do a non-imputation
imp_base <- mice(dat, m=0, maxit = 0)

# Find the methods that mice chooses
imp_base$method
# Returns: "pmm" "pmm" "pmm"

# Look at the imputation matrix
imp_base$predictorMatrix
# Returns:
#   x1 x2 x3
#x1  0  1  1
#x2  1  0  1
#x3  1  1  0

The pmm stands for predictive mean matching - probably the most popular imputation algorithm for imputing continuous variables. It calculates the predicted value using a regression model and picks the 5 closest elements to the predicted value (by Euclidean distance). These chosen elements are called the donor pool and the final value is chosen at random from this donor pool.

From the prediction matrix we find that the methods get the variables passed that are of interest for the restrictions. Note that the row is the target variable and the column the predictors. If x1 did not have 1 in the x3 column we would have to add this in the matrix: imp_base$predictorMatrix["x1","x3"] <- 1

Now to the fun part, generating the imputation methods. I've chosen a rather crude method here where I discard all values if they don't meet the criteria. This may result in long loop time and it may potentially be more efficient to keep the valid imputations and only redo the remaining ones, it would require a little more tweaking though.

# Generate our custom methods
mice.impute.pmm_x1 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    max_sum <- sum(max(x[,"x2"], na.rm=TRUE),
                   max(x[,"x3"], na.rm=TRUE))
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals < max_sum)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x2 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 14)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x3 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 16)){
        break
      }
    }
    return(vals)
  }

Once we are done defining the methods we simple change the previous methods. If you only want to change a single variable then you can simply use imp_base$method["x2"] <- "pmm_x2" but for this example we will change all (the naming is not necessary):

imp_base$method <- c(x1 = "pmm_x1", x2 = "pmm_x2", x3 = "pmm_x3")

# The predictor matrix is not really necessary for this example
# but I use it just to illustrate in case you would like to 
# modify it
imp_ds <- 
  mice(dat, 
       method = imp_base$method, 
       predictorMatrix = imp_base$predictorMatrix)

Now let's have a look at the third imputed dataset:

> complete(imp_ds, action = 3)
    x1 x2 x3
1   21  0  0
2   50 19  0
3   31 18  0
4   15  0  0
5   36 19  0
6   82  0 54
7   14  0  0
8   14  0  0
9   19  0  0
10  18  0  0
11  16  0  0
12  36  0  0
13 583  0  0
14  50 22  0
15  52 19  0
16  14  0  0
17  50 22  0
18  52  0  0
19  26  0  0
20  24  0  0

Ok, that does the job. I like this solution as you can piggyback on top of mainstream functions and just add the restrictions that you find meaningful.

Update

In order to enforce the rigorous restraints @t0x1n mentioned in the comments, we may want to add the following abilities to the wrapper function:

  1. Save valid values during the loops so that data from previous, partially successful runs is not discarded
  2. An escape mechanism in order to avoid infinite loops
  3. Inflate the donor pool after trying x times without finding a suitable match (this primarily applies to pmm)

This results in a slightly more complicated wrapper function:

mice.impute.pmm_x1_adv <-   function (y, ry, 
                                      x, donors = 5, 
                                      type = 1, ridge = 1e-05, 
                                      version = "", ...) {
  # The mice:::remove.lindep may remove the parts required for
  # the test - in those cases we should escape the test
  if (!all(c("x2", "x3") %in% colnames(x))){
    warning("Could not enforce pmm_x1 due to missing column(s):",
            c("x2", "x3")[!c("x2", "x3") %in% colnames(x)])
    return(mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                           version = "", ...))
  }

  # Select those missing
  max_vals <- rowSums(x[!ry, c("x2", "x3")])

  # We will keep saving the valid values in the valid_vals
  valid_vals <- rep(NA, length.out = sum(!ry))
  # We need a counter in order to avoid an eternal loop
  # and for inflating the donor pool if no match is found
  cntr <- 0
  repeat{
    # We should be prepared to increase the donor pool, otherwise
    # the criteria may become imposs
    donor_inflation <- floor(cntr/10)
    vals <- mice.impute.pmm(y, ry, x, 
                            donors = min(5 + donor_inflation, sum(ry)), 
                            type = 1, ridge = 1e-05,
                            version = "", ...)

    # Our criteria check
    correct <- vals < max_vals
    if (all(!is.na(valid_vals) |
              correct)){
      valid_vals[correct] <-
        vals[correct]
      break
    }else if (any(is.na(valid_vals) &
                    correct)){
      # Save the new valid values
      valid_vals[correct] <-
        vals[correct]
    }

    # An emergency exit to avoid endless loop
    cntr <- cntr + 1
    if (cntr > 200){
      warning("Could not completely enforce constraints for ",
              sum(is.na(valid_vals)),
              " out of ",
              length(valid_vals),
              " missing elements")
      if (all(is.na(valid_vals))){
        valid_vals <- vals
      }else{
        valid_vals[is.na(valid_vals)] <- 
          vals[is.na(valid_vals)]
      }
      break
    }
  }
  return(valid_vals)
}

Note that this does not perform that well, most likely due to that the suggested data set fails the constraints for all cases without missing. I need to increase the loop length to 400-500 before it even starts to behave. I assume that this is unintentional, your imputation should mimic how the actual data is generated.

Optimization

The argument ry contains the non-missing values and we could possibly speed up the loop by removing the elements that we have found eligible imputations, but as I'm unfamiliar with the inner functions I have refrained from this.

I think the most important thing when you have strong constraints that take time to full-fill is to parallelize your imputations (see my answer on CrossValidated). Most have today computers with 4-8 cores and R only uses one of them by default. The time can be (almost) sliced in half by doubling the number of cores.

Missing parameters at imputation

Regarding the problem of x2 being missing at the time of imputation - mice actually never feeds missing values into the x-data.frame. The mice method includes filling in some random value at start. The chain-part of the imputation limits the impact from this initial value. If you look at the mice-function you can find this prior to the imputation call (the mice:::sampler-function):

...
if (method[j] != "") {
  for (i in 1:m) {
    if (nmis[j] < nrow(data)) {
      if (is.null(data.init)) {
        imp[[j]][, i] <- mice.impute.sample(y, 
                                            ry, ...)
      }
      else {
        imp[[j]][, i] <- data.init[!ry, j]
      }
    }
    else imp[[j]][, i] <- rnorm(nrow(data))
  }
}
...

The data.init can be supplied to the mice function and the mice.imput.sample is a basic sampling procedure.

Visiting sequence

If visiting sequence is important you can specify the order in which the mice-function runs the imputations. Default is from 1:ncol(data) but you can set the visitSequence to be anything you like.

Max Gordon
  • 5,616
  • 8
  • 30
  • 51
  • +1 This is great stuff, exactly what I had in mind (see my comment to Frank's answer), and for sure the #1 candidate for the bounty as of now. A couple of things trouble me about `pmm_x1` though: (1) Taking the maximum sum of any possible combination of `x2` and `x3` from the entire data set is much stringer than the original constraint. The correct thing would be to test that *for each row*, `x1 < x2 + x3`. Of course the more rows you have, the smaller your chance of complying with such a constraint (as a single bad row ruins everything), and the longer the loop can potentially get. – t0x1n Nov 25 '14 at 22:38
  • (2)if both `x1` and `x2` are missing, you may impute a value for `x1` for which the constraints are held (let's say 50), but once `x2` gets imputed they are broken (let's say it's imputed to be 55). Is there a way to impute "horizontally" rather than vertically? That way we could impute a single row of `x1`, `x2`, and `x3` and simply re-impute it until that specific row falls under the constraints. That should be quick enough, and once that's done we can move to the next row. Of course if MI is "vertical" in its nature we're out of luck. In that case, maybe the approach Aleksandr mentioned? – t0x1n Nov 25 '14 at 22:46
  • Cool solution, +1! Might be especially handy, as I currently use `mice` package. Thanks for sharing. – Aleksandr Blekh Nov 25 '14 at 23:51
  • 1
    @t0x1n I've updated my answer with a more advanced wrapper function according to your comments. If you want to dive deeper, I recommend that you play around with the `debug()` in order to see how `mice.impute.pmm` and its siblings work under the hood. – Max Gordon Nov 26 '14 at 11:02
  • I haven't understood Max's cool code well enough yet to know if the following problem exists with it: Suppose that observations 1 and 2 needs to have x imputed. If you generate 100 multiple imputations for x for each of these two observations and imputation 2 is the first to satisfy the constraints for observation 1 and imputation 7 is the first to satisfy constraints for observation 2, filling in `NA`s for different observations with different multiple imputation choices will possibly not preserve the proper collinearities in the data. I'm not thinking this through well, though. – Frank Harrell Nov 27 '14 at 13:09
  • @FrankHarrell: I'm not sure I follow - my code discards invalid imputation attempts but it should not affect the iteration convergence. The code runs at each iteration, meaning that both observation 1 and 2 will have to pass the test before continuing to the next iteration - I think this is the critical part. The `break` could be an issue, but since the complete cases don't fit the restrictions, it is not surprising that the code fails to find matches. If the restrictions are true to the data generating procedure, I'm not sure how the collinearity issue would arise. – Max Gordon Nov 27 '14 at 13:56
  • Thanks Max. I just want to make sure that if you are doing $M$ multiple imputations you are getting the imputed values for all rows from the same column $1, 2, \cdots, M$. – Frank Harrell Nov 28 '14 at 12:28
  • Well done! That takes care of the max sum issue. However I'm still worried about cases where multiple variables require imputation in the same row. For example, suppose my constraint is `x1 – t0x1n Nov 28 '14 at 23:59
  • @t0x1n: I don't think it is a major issue but you should probably change the `visitSequence` so that the less complex dependencies are resolved at the end, e.g. in this example `visitSequence = c(2:3, 1)`. – Max Gordon Nov 29 '14 at 05:38
  • @MaxGordon I'm not sure that's going to help. Just like you need to maintain `x1x1-x3` when you impute `x2` and you need to maintain `x3>x1-x2` when you impute `x3`. Nevertheless, your answer is the most practical so far - bounty awarded :) BTW, why the community wiki? – t0x1n Nov 29 '14 at 10:48
  • @t0x1n Thxs. The wiki was a mistake, it was the "accept terms"-reflex click. Although since I've had a few revisions it's probably fair that I don't receive points for votes. Back to imp. problem: I've personally found that with complex restrictions the imputation don't work that well (I have some x-ray measurements with propagating restrictions). More fine-tuned functions or setting the `data.init` to get good starting value could be plausible solutions. I'm afraid that this would interfere with the random nature of the `mice` procedure and therefore refrained from this strategy in my paper. – Max Gordon Nov 29 '14 at 13:00
  • I see, thanks for the analysis. Any final imputation advice for my specific constraints: `x1 – t0x1n Nov 29 '14 at 13:19
  • 1
    @t0x1n: I guess - inspect your imputed values. If they seem unrealistic then you may choose my approach to only impute those that are less central to the model. In my case I've chosen to exclude those without follow-up x-rays since they are at the heart of the study and the imputations don't provide clinically plausible values (leg becoming longer after a fracture). I'm not entirely satisfied with this but it seems like a reasonable compromise. – Max Gordon Nov 29 '14 at 13:30
8

The closest thing I could find is Amelia's prior information inclusion. See chapter 4.7 in the vignette, specifically 4.7.2:

Observation-level priors

Researchers often have additional prior information about missing data values based on previous research, academic consensus, or personal experience. Amelia can incorporate this information to produce vastly improved imputations. The Amelia algorithm allows users to include informative Bayesian priors about individual missing data cells instead of the more general model parameters, many of which have little direct meaning.

The incorporation of priors follows basic Bayesian analysis where the imputation turns out to be a weighted average of the model-based imputation and the prior mean, where the weights are functions of the relative strength of the data and prior: when the model predicts very well, the imputation will down-weight the prior, and vice versa (Honaker and King, 2010).

The priors about individual observations should describe the analyst's belief about the distribution of the missing data cell. This can either take the form of a mean and a standard deviation or a condence interval. For instance, we might know that 1986 tari rates in Thailand around 40%, but we have some uncertainty as to the exact value. Our prior belief about the distribution of the missing data cell, then, centers on 40 with a standard deviation that re ects the amount of uncertainty we have about our prior belief.

To input priors you must build a priors matrix with either four or five columns. Each row of the matrix represents a prior on either one observation or one variable. In any row, the entry in the rst column is the row of the observation and the entry is the second column is the column of the observation. In the four column priors matrix the third and fourth columns are the mean and standard deviation of the prior distribution of the missing value.

So while you won't be able to generally say something like x1<x2+x3, you could loop over your data set and add an observation-level prior for each relevant case. Constant bounds can also be applied (such as setting x1, x2, and x3 to be non-negative). For example:

priors = matrix(NA, nrow=0, ncol=5);
for (i in seq(1, length(data))) 
{
    x1 = data$x1[i];
    x2 = data$x2[i];
    x3 = data$x3[i];

    if (is.na(x1) && !is.na(x2) && !is.na(x3))
    {
        priors = rbind(priors, c(i, 1, 0, x2+x3, 0.999999))
    }
}

amelia(data, m=1, bound = rbind(c(1, 0, Inf), c(2, 0, Inf), c(3, 0, Inf)), pr = priors);
t0x1n
  • 185
  • 1
  • 11
6

Constraints are probably easier to implement in predictive mean matching multiple imputation. This assumes that there is a significant number of observations with non-missing constraining variables that meet the constraints. I'm thinking about implementing this in the R Hmisc package aregImpute function. You may want to check back in a month or so. It will be important to specify the maximum distance from the target that a donor observation can be, because the constraints will push donors further from the ideal unconstrained donor.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • I would love to have this too. I have only need for the most basic inter-variable constraints, say `x – t0x1n Nov 18 '14 at 23:10
  • Forgive my ignorance if I'm way off, but I was under the impression that multiple imputation techniques involve drawing values from an appropriate distribution. Shouldn't it be a simple matter then to use rejection sampling? e.g. keep drawing until some specified constraint (such as `x1 – t0x1n Nov 24 '14 at 22:20
  • 1
    That is what I might do with the R `aregImpute` function with predictive mean matching. But what if none of the donor observations (near matches of predictions) satisfies the constraints for the target observation being imputed even though they obviously had to meet constraints upon the set of donor variables? – Frank Harrell Nov 26 '14 at 12:24
  • In such a case, perhaps take the predicted value directly? That is only rely on regression (without the PMM phase) for such a sample? – t0x1n Nov 26 '14 at 21:12
  • Regression imputation is slightly more likely to come up with imputed values that are inconsistent with the rest of the subject's record. So I don't think this is a reason to avoid PMM. – Frank Harrell Nov 26 '14 at 22:04
  • I meant on a per-row basis. So if for some row, all near matches of the prediction break the constraint, take the predicted value as-is for that specific row. Surely, a value that's slightly more likely to be inconsistent is preferable to a value that is inconsistent for certain (by account of breaking the constraints). Now, if the predicted value itself breaks the constraints (say for a predetermined amount of draws), that's a problem. I guess in that case one of the bounds will have to be taken (depending on which side the predicted value lies). – t0x1n Nov 27 '14 at 20:36
  • Used the predicted value (conditional mean imputation) will not work here. You would have to remove the outcome variable from the list of predictors during imputation (resulting in biased low final regression coefficients), and final standard errors will be biased low also. – Frank Harrell Nov 28 '14 at 12:26
4

I believe that the Amelia (Amelia II) package currently has the most comprehensive support for specifying data values range constraints. However, the problem is that Amelia assumes that data is multivariate normal.

If in your case the assumption of multivariate normality doesn't apply, you might want to check mice package, which implements multiple imputation (MI) via chained equations. This package doesn't have the assumption of multivariate normality. It also has a function that might be enough for specifying constraints, but I'm not sure to what degree. The function is called squeeze(). You can read about it in the documentation: http://cran.r-project.org/web/packages/mice/mice.pdf. An additional benefit of mice is its flexibility in terms of allowing specification of user-defined imputation functions and wider selection of algorithms. Here's a tutorial on performing MI, using mice: http://www.ats.ucla.edu/stat/r/faq/R_pmm_mi.htm.

As far as I understand, Dr. Harrell's Hmisc package, using the same chained equations (predictive mean matching) approach, probably supports non-normal data (with the exception of normpmm method). Maybe he has already implemented the constraints specification functionality per his answer above. I haven't used aregImpute(), so can't say much more about it (I've used Amelia and mice, but I'm definitely not an expert in statistics, just trying to learn as much as I can).

Finally, you may find interesting the following, a little dated, but still nice, overview of approaches, methods and software for multiple imputation of data with missing values: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839993. I'm sure that there are newer overview papers on MI, but that's all I'm aware of at the present time. I hope that this is somewhat helpful.

Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93
  • 1
    This nice comment makes me think that predictive mean matching, which replaces missings with actually observed values, may already incorporate some kinds of contraints if all observed data meet those constraints. I would appreciate someone thinking this through. I have not yet implemented any special constraints in `aregImpute`. – Frank Harrell Nov 22 '14 at 19:06
  • @FrankHarrell: Thank you for kind words! I'd love to think about it, but: 1) I believe that my current understanding of the MI mechanics is not enough to deal with this; 2) I have several days until I have to submit the first revision of my dissertation report and there is still quite a lot of work to be done, so you can imagine the situation and stress associated with it. – Aleksandr Blekh Nov 22 '14 at 19:26
  • 1
    You are right. I just realized that the donor observations provide values that are consistent with _their_ other variables but not with the other variables in the target variable. – Frank Harrell Nov 22 '14 at 22:09
  • 1
    Aside from the distribution assumptions made by Amelia, were you by any chance able to specify constraints in further detail than I've demonstrated in my answer? The problem with `squeeze` is that its bounds are constant, so you can't specify anything like `x1 – t0x1n Nov 24 '14 at 22:12
  • 1
    @t0x1n: Unfortunately, I haven't had a chance to specify constraints in `Amelia`, because I've switched from it to `mice`, as soon as my tests confirmed that my data is not multivariate normal. However, I recently ran across this very nice set of presentation slides on the topic (MI methods and software): http://www.statistik.lmu.de/~fkreuter/imputation_sose2011/downloads/Imputationsvorlesung_5_slides.pdf. If I understood correctly, it describes a potential solution to the constraints problem (see PDF's page 50 - not slide number 50!). Hope this helps. – Aleksandr Blekh Nov 25 '14 at 17:29
  • 1
    @t0x1n: Actually, the solution is described on pages 50 and 51. – Aleksandr Blekh Nov 25 '14 at 17:38
0

If I understand your question correctly, it seems to me that you already know what values the missing variables should take subject to some constraints. I am not very conversant in SPSS but in R I think you can write a function to do that (which shouldn't be too difficult depending on your experience I should say). I don't know of any package that works with such constraints.