Imputation of missing response variables

Question

I am doing multiple imputation on a database of observations on hospital patients. There is one observation of many covariates per patient. There are 2 binary outcome variables:

Alive/Dead after 30 days
Died in hospital, or survived/discharged

Two seperate analysis models (logistic regression), with identical covariates, are to be run, each with one of these outcomes as the response. There are 7 covariates in the analysis models.

There is missing data of between 3 and 11% in the covariates and the outcomes.

In addition there are 7 further covariates that are to be used in the imputation model to predict the missingness in the covariates and outcomes.

My questions concerns the imputation of the two outcome variables. They are to be used as predictors for missingness in the covariates, as per standard practice, but they are highly collinear with each other. Is this a concern for the imputation model ? Is it valid/recommended to impute them both in the same model (to generate several complete datasets all containing both outcomes) or should seperate imputations be performed for each of the outcomes (to generate two distinct sets of complete datasets, each distinct set having one of the outcomes) ? Any other suggestions for how to proceed would be welcome.

Do you want to do imputation of one of the outcomes (when one is missing) or are you only asking about the imputation of the covariates? — Guillaume, Nov 07 '12 at 09:22
@Guillaume I am interested in both situations - imputing one when the other is missing, and also using both of them to impute the other covariates. That's where my concern about collinearity comes from) — Joe King, Nov 08 '12 at 17:48

score 3 · Answer 1 · answered Sep 06 '14 at 16:35

As per the first answer, in general there is no reason not to impute all of your variables in one go, generating a single set of imputed datasets. Your two outcomes being strongly correlated should not be too much of an issue - when they serve as independent variables in the imputation model(s) for your missing covariates, their corresponding coefficients will be estimated imprecisely, but this isn't usually a problem in terms of drawing from the resulting imputation distribution.

There is a long discussion in the comments between AdamO and Joe King, and one of the discussions is about when complete case analysis / listwise deletion is unbiased. If data are MCAR, it is unbiased of course. If missingness is independent of the dependent variable in the model of interest, conditional on the covariates in the model of interest, it is also unbiased (this point is made by Little and Rubin, in their book). Depending on where the missingness occurs, this condition can sometimes correspond to an MAR mechanism and sometimes to a MNAR mechanism. It is not correct that complete case analysis is generally unbiased under MAR mechanisms.

For more on this, see for example this paper: http://www.ncbi.nlm.nih.gov/pubmed/20842622

I also wrote a post about complete case analysis validity some time ago on my blog: http://thestatsgeek.com/2013/07/06/when-is-complete-case-analysis-unbiased/

score 1 · Answer 2 · answered Nov 07 '12 at 20:15

1

In general, multiple imputation works by using all available information in the model to simulate the missing values: I use the word "simulate" because you're technically doing more than just prediction, which involves more parametric assumptions.

I assume the outcomes are either jointly missing or jointly observed, there are no cases where outcome A is known when outcome B isn't, or vice versa.

Collinearity is not an issue. If you are treating these outcomes as independent (reporting odds ratios from two logistic regression models with separate outcomes), then you don't even need to worry about whether contradictory outcomes are simulated: (e.g. patients both discharged living and died within 30 days -- I assume you wouldn't have observed that in the hospital).

Two overall thoughts about the analysis:

Why aren't you reporting a Cox proportional hazards model? Treating death/discharge as a 1/0 event indicator and time until the observed discharge or death as the time-to-event is a very similar, common, and preferred analysis. The hazard ratios approximate relative risks, just like odds ratios (for rare outcomes) and patients who are discharged should be censored. This way you use all available information about when patients were at risk for dying, it is a much more powerful analysis.
I don't really agree with the need for missing data methods. Complete case analyses of data (where rows of missing data are dropped from the analysis) are unbiased and require little validation of assumptions, unlike Multiple Imputation which has caveats of estimating and validating parametric models. 13% is fairly negligible when $n$ is 150 or more. As a rule of thumb, having 20 events (these are deaths) per variable in the adjusted model is sufficient for power considerations.

answered Nov 07 '12 at 20:15

AdamO

52,330
5
104
209

Thank you ! A few points in reply. 1: There are no rows where all data are missing. 2: Many rows have only 1 missing variable, so to exclude the row think leads to bias (they are not MCAR). 3:There are cases where a patient was discharged from the hospital, but died within 30 days, and also where a patient was alive at 30 days, but died in hospital (after 30 days, obviously). I think this precludes a Cox model ? 3: There are many cases where one outcome is recorded, but the other is missing. 4: My concern with collinearity is that these outcomes are used as predictors in the imputation model. – Joe King Nov 07 '12 at 22:49
1: okay. 2: You're wrong, see Rubin's Statistical Analysis with Missing Data 2nd ed. CC is unbiased with MAR data, MI propagates bias in NIM. 3: Read some NEJM / JAMA articles. If you observe patients after discharge, you should do time-to-death analyses, using time-varying covariates for discharge. You can model logistic regression for 30 day discharge among survivors, but be careful of interpreting results. 3: shouldn't matter, then, use one to predict the other and vice versa, they are strong proxies 4: that's not a collinearity issue. You use outcomes as predictors in MI. – AdamO Nov 07 '12 at 23:30
Thanks again...2: That's not what I took home from the book...I thought it was clear on that. For example in my data, some variables are strongly predictive of missingness in others which violates MCAR but not MAR, and deleting them would certainly lead to bias. Also it would mean discarding around half of the entire dataset ! 3: I don't have time-to-death data. 4: The imputation model uses both these outcomes as predictors for imputation, but they are highly collinear - so there are 2 highly collinear predictors, used as covariates to impute (for example) a patient's age - that's my concern. – Joe King Nov 08 '12 at 08:19
You should reread it. I don't have it on hand, so I can't cite anything. Just do a simulation and convince yourself. You originally reported 13% of the data missing, now you report 50% are missing. This changes the scope of my original recommendation. Please revise your post to more accurately describe the problem. – AdamO Nov 08 '12 at 18:04
No, I am not saying there is 50% missing - I am saying there is around 50% of data that are complete (ie with no missing variables). There is between 3 and 11% of each variable missing - and that's what I said in my post. For the 50% of observations that are not complete, typically there is 1 or sometimes 2, and occasionally more, variables missing. – Joe King Nov 08 '12 at 18:43
REgarding the question of listwise deletion / complete case analysis being biased under MAR, I don't have the book available either, but I found lots of references on the net including [this one](http://www.academia.edu/581063/Four_Techniques_for_Dealing_with_Missing_Data) _"Furthermore, if the data are MAR and not MCAR, listwise deletion will produce biased estimates (Allison, 2002; Little & Rubin, 2002)"_ – Joe King Nov 08 '12 at 18:53
This reference is unpublished and hence unreviewed. Can you provide a reference from a scholarly publication? In response to the previous comment, I still believe you have failed to describe the nature of your missing data adequately. – AdamO Nov 08 '12 at 21:35
Sure, there are so many references ! I just chose the previous one, as that one specifically referred to Rubin and Little for the justification. [Here](http://os1.amc.nl/mediawiki/images/donders_-_jce_2006_missing_values.pdf) is a scholarly one: **"In most situations, simple techniques for handling missing data (such as complete case analysis, overall mean imputation, and the missing-indicator method) produce biased results"** – Joe King Nov 08 '12 at 22:47
And from the same paper **"Complete and available case analyses provide inefﬁcient though valid results when missing data are MCAR, but biased results when missing data are MAR,"** – Joe King Nov 08 '12 at 22:55
Not ragging on these Dutch guys but they show no analytical results to convince CC is all that bad. Their citations actually show that semiparametric data (GEE, Cox models) are biased but MLE like logistic regression is *unbiased in MAR data*. Read the book! Ambler does a better job: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.125.2515&rep=rep1&type=pdf. In the analysis of binary data, you get bias with small sample sizes, so poorly calibrated logistic regression models do exacerbate bias *that's already present in the imputed models*. – AdamO Nov 09 '12 at 00:36
Unfortunately I don't live near a library that has this or similar books so until I can visit a University library I have to rely on online materials. I don't see how the article by Ambler et al that you linked to supports your view that CC is unbiased under MAR. They seem to clearly show that CC **is** biased, for example see Table 3/4/5/6. They also state, in the abstract **"complete case analysis may produce unreliable risk predictions and should be avoided"** and in the discussion **"ignoring missing data and performing a complete case analysis can lead to substantial bias"** – Joe King Nov 09 '12 at 17:36
There are three objectives in an analysis: estimation of risk model coefficients, reliability of risk predictions, and inference of associations. Those conclusions were based on 2. If you're not familiar with small sample bias, you should look this up. This is the crux of their problem and you haven't described your data. As I said, you should report your number of "events per variable" in the analysis. – AdamO Nov 09 '12 at 17:48
Also as I said, I suggest you post for us the results of a simulation study using your data and a controlled missing data mechanism to motivate the use of missing data methods. I feel this commentary is not being evaluated impartially, so I see no need to continue the discussion. – AdamO Nov 09 '12 at 17:50
Thanks again. I'm sorry you don't see the need to continue the discussion. I am certainly trying to understand your commentary and take it on board but I have only begun learning statistics this year so I am not very experienced. I didn't notice where you asked previously for me to report the number of events per variable. Can you explain what that means ? Do you mean the total number of observations and the number of missing entries for each variable ? I hope you will reconsider continuing the discussion, as I am finding it extremely helpful - though somewhat confusing ;) – Joe King Nov 09 '12 at 19:34

Imputation of missing response variables

2 Answers2

Linked