How to test that your missing data is completely at random (MCAR)

Question

I am doing a secondary analysis of data from a local trial (N=450, mean follow-up longer than 10 years). I am specifically looking at a secondary outcome (diagnosis of hypertension) after 10 years from the start of the trial.

My problem is that the trial itself only lasted 2 years and, for my analysis (which is using a 10-year follow-up time), I have complete data only on approximately 50% of the sample (i.e. there are no clinical records for 50% of the original sample due to various reasons such as drop-off, death, etc). Considering the high proportion of missing data (without even testing the MAR assumption), I am not considering the multiple imputation as an option. Therefore, the two possibilities that seem worth exploring are i) still run the analysis using complete-case analysis ii) applying the inverse probability weighting.

Because I suspect that the lost at follow-up has no systematic reason (the trial was about the use of a coenzyme which hardly might be associated to the likelihood of dropping off) I would be inclined to analyse the data with complete-case analysis. What I initially did was to test differences in the distribution of study covariates between the population with data at follow-up and the population with missing data. I found no difference between the two populations. I came across this approach in old papers but I am not sure this is robust enough (also, I remember I have read these papers but cannot find them now for a closer look at the issue). Also, I am not sure whether the comparison should have been conducted the way I did (population with complete data vs population with missing data) or should have been between the initial population and the population with complete data - under the assumption that the population with complete data is still representative. Can you help me clarifying this?

However, I suspect this approach is not sufficient. I read about the Little test (which doesn't seem to be used much anyway as it is not a gold standard), but I also read that you can test this assumption by creating dummy variables for missing data and employing probit/logistic regression models using each covariate at time as independent variable. However, I am not sure I understood it correctly as I could not find a practical example.

Any help would be appreciated. Also, references would be very welcome.

Thanks

What do you mean "you have complete data on 50% of the population" (you mean sample not population). Do you mean only 50% were actually diagnosed with hypertension or...? — AdamO, Feb 06 '18 at 22:04
Hi. Thank you for your answer. I edited the question accordingly. What I meant is that because it is a secondary analysis using a much longer follow-up than the one defined by design with the original trial, I don't have data on approximately 50% of the sample. This is because time from the beginning of the study is more than 10 years. — Vincent, Feb 07 '18 at 09:19

score 1 · Answer 1 · answered Feb 07 '18 at 12:14

The approach you outline towards the end of your question is often done in practice. You create an indicator variable for missing (yes/no) and then do logistic regression to see whether anything predicts missingness. You then include any variables which do predict missingness in your final model even if they are not otherwise of scientific or clinical interest. Doing comparisons the way you did will probably lead to the same conclusion but a multiple logistic regression is preferable as it allows for the covariates together rather than one by one.

How to test that your missing data is completely at random (MCAR)

1 Answers1