0

I have a dataset with missing values in both predictors and the response. As far as I know, the data are missing not at random, so I cannot simply use listwise deletion. Instead, I employed the EM algorithm for imputation. However, there are at least three ways of accomplishing this:

  1. Just run EM on the whole dataset with predictors and the response.
  2. Strip out the response, and run EM on the predictors.
  3. Throw away any record with a missing response, and run EM on the whole dataset.

To add some context, this is for my final project on Regression Analysis, and I'm going to run linear regression on the imputed data. The data also exhibit a high level of multicollinearity, in case that's relevant.

Which method should I choose?

nalzok
  • 1,385
  • 12
  • 24
  • 1
    What kind of algorithm do you use to impute these values (EM AFAIK is optimization algorithm like gradient descent)? Where does missing data occurs, is it on few specific columns or is it random? Lastly, how big is your data (will losing a few data points "hurt" you?) – Yohanes Alfredo Dec 29 '19 at 07:18
  • See: https://stats.stackexchange.com/questions/41628/imputation-of-missing-response-variables – kjetil b halvorsen Dec 31 '19 at 15:26

0 Answers0