I am trying to estimate a probit model on the probability of suicide over the next year in a population. Unfortunately for this research, suicide rates are very very low so the probability of suicide in the next year on this population is below 0.2%.
Additionally, the data is incomplete, so most of the suicide events come from individuals with almost all missing covariates (e.g. income, age,etc.). What is the best approach with respect to these individuals with few or no covariates?
- Exclude them from the regression? [Wouldn't I have a lower suicide rate if the missingness of covariates is not evenly distributed between 0 and 1 of my dependent variable?]
- Impute the covariate values with the values of the other suicidal individuals? [Wouldn't I run into perfect separation with this approach?]