How to deal with missingness of dependent variable in unbalanced probit model

Question

I am trying to estimate a probit model on the probability of suicide over the next year in a population. Unfortunately for this research, suicide rates are very very low so the probability of suicide in the next year on this population is below 0.2%.

Additionally, the data is incomplete, so most of the suicide events come from individuals with almost all missing covariates (e.g. income, age,etc.). What is the best approach with respect to these individuals with few or no covariates?

Exclude them from the regression? [Wouldn't I have a lower suicide rate if the missingness of covariates is not evenly distributed between 0 and 1 of my dependent variable?]
Impute the covariate values with the values of the other suicidal individuals? [Wouldn't I run into perfect separation with this approach?]

Clearly you would not want to concentrate on those with complete data, you may have to worry that suicide rates could be different for people in different situations that would also affect data availability (think e.g. if homelessness, unemployed, migrants, students, people in retirement homes, people in the armed forces etc.). — Björn, Oct 14 '18 at 11:18
@Björn If I don't exclude observations with incomplete data, I can barely run a probit model. If I exclude observations with incomplete data for suicide==0 and impute the covariates for suicide observations, I would bias the suicide rate. I am unsure what is best/less worse. — user3507584, Oct 14 '18 at 11:28
See [this page](https://stats.stackexchange.com/q/46226/28500) for further discussion about multiple imputation of outcome variables. Multiple imputation avoids problems that can arise from single imputation. — EdM, Oct 15 '18 at 19:17

Dimitris Rizopoulos · Accepted Answer · 2018-10-15T19:03:43.063

An issue here, like in all settings where you have missing data, is understanding the potential missing data mechanism. That is, why subjects who suicide have missing covariates? Do you expect that the reasons why this data are not recorded are related to the fact that they committed suicide?

If not, then you could use a multiple imputation approach in which the relationships between the covariates and the outcome from all subjects are used to impute the missing data.

If yes, then you're indeed in a more difficult situation, and you will need to go to more complicated approaches (e.g., multiple imputation under missing not at random) and sensitivity analysis.

How to deal with missingness of dependent variable in unbalanced probit model

1 Answers1