Missing data not at random - Advice needed on method

Question

I have been developing a logistic regression model based on retrospective data from a national trauma database of head injury in the UK. The key outcome is 30 day mortality (denoted as "Survive" measure). Other measures across the whole database with published evidence of significant effect on outcome in previous studies include:

Year - Year of procedure = 1994-2013
Age - Age of patient = 16.0-101.5
ISS - Injury Severity Score = 1-75
GCS - Glasgow Coma Scale = 3-15
Sex - Gender of patient = Male or Female
inctoCran - Time from head injury to craniotomy in minutes = 0-2880 (After 2880 minutes is defined as a separate diagnosis)

There are two additional variables where the data is only available from 2004:

Pupil reactivity (1 is brisk, 2 is sluggish and 3 is unreactive)
Pupil size (continuous variable in mm - 1-10)

Based on literature these are significant predictors of outcome. However, out of a series of 2140, I am missing 936. Secondly, the measure is not missing at random, having only been collected in recent years.

My questions are the following in order to address the year range 1994-2013:

1) My data is heavily skewed to later years; how can I adapt the logistic regression to reduce the effect of this when assessing the effect of the year of procedure on outcome?

2) Can I exclude pupil reactivity since it was not collected before 2004 in performing this analysis even if it is a strong predictor?

3) If I should include pupil reactivity, can a multivariate regression be built with the variables above with which to perform imputation to create data for 1994-2003 given 43% of the data is missing?

4) If not possible, could imputation be performed based on data since 2009 where ~15% is missing?

I perform all statistical analyses exclusively with R and would be grateful if you could add known packages/formulae to execute your suggestions.

Missing Not at Random (MNAR) is by definition missingness related to the outcome, something different from what you mean, I believe. I am not sure what is a better term, truncated data? Structural missingness? Anyway, perhaps this will be helpful: http://stats.stackexchange.com/questions/48483/coding-of-semi-numerical-variables/48608. Basically, you create a dummy to capture whether a certain case contains additional data. — Maxim.K, Nov 18 '14 at 13:17
Many thanks for your thoughts and the link. All patients without pupil reactivity data (shown in green in the histogram above) will by definition have a pupil reactivity. The data was just not recorded for this patients as the database only began doing so from 2004. In terms of the pupil reactivity, the variable is structured as follows: 1 is brisk, 2 is sluggish and 3 is unreactive. — Dan Fountain, Nov 18 '14 at 14:49
It doesn't matter if they had pupil reactivity or not: the data is not there anyway. By introducing the dummy you simply designate this group as special, to ensure that the effect of reactivity is estimated properly from the recent data group. See also this post: http://stats.stackexchange.com/questions/56306/time-spent-in-an-activity-as-an-independent-variable. — Maxim.K, Nov 18 '14 at 16:14
This is clear. Very helpful posts you have added here and apologies for posting on something with an answer already. — Dan Fountain, Nov 18 '14 at 18:28

Aksakal · Accepted Answer · 2015-07-16T15:19:19.667

This is not "missing not at random" unless you have a time trend. If you have a time trend then you could say that presence of the data depends on time, i.e. correlated with a covariate.

Let's say you have a linear model: $$y=c+X\beta+Z\gamma+\varepsilon$$ where $X$ is the covariates which are always available, and $Z$ are the covariates which are only available recently, and $c$ is the intercept.

You can adjust this equation for the period when $Z$ were not available as follows: $$y=c+d+X\beta+\varepsilon$$ where $d$ is the intercept adjustment for the means of the missing data.

To estimate this setup in one equation all you need is the dummy $\delta$ which is 1 when the $Z$ are missing, and 0 when they are present: $$y=c+\delta d+X\beta+Z\gamma+\varepsilon$$

Missing data not at random - Advice needed on method

1 Answers1

Linked