I have been developing a logistic regression model based on retrospective data from a national trauma database of head injury in the UK. The key outcome is 30 day mortality (denoted as "Survive" measure). Other measures across the whole database with published evidence of significant effect on outcome in previous studies include:
Year - Year of procedure = 1994-2013
Age - Age of patient = 16.0-101.5
ISS - Injury Severity Score = 1-75
GCS - Glasgow Coma Scale = 3-15
Sex - Gender of patient = Male or Female
inctoCran - Time from head injury to craniotomy in minutes = 0-2880 (After 2880 minutes is defined as a separate diagnosis)
There are two additional variables where the data is only available from 2004:
Pupil reactivity (1 is brisk, 2 is sluggish and 3 is unreactive)
Pupil size (continuous variable in mm - 1-10)
Based on literature these are significant predictors of outcome. However, out of a series of 2140, I am missing 936. Secondly, the measure is not missing at random, having only been collected in recent years.
My questions are the following in order to address the year range 1994-2013:
1) My data is heavily skewed to later years; how can I adapt the logistic regression to reduce the effect of this when assessing the effect of the year of procedure on outcome?
2) Can I exclude pupil reactivity since it was not collected before 2004 in performing this analysis even if it is a strong predictor?
3) If I should include pupil reactivity, can a multivariate regression be built with the variables above with which to perform imputation to create data for 1994-2003 given 43% of the data is missing?
4) If not possible, could imputation be performed based on data since 2009 where ~15% is missing?
I perform all statistical analyses exclusively with R and would be grateful if you could add known packages/formulae to execute your suggestions.