7

I have a huge data (4M x 17) that has missing values. Two columns are categorical, rest all are numerical. Given the huge amount of data, running any imputation method runs forever. What should I do?

I was wondering if I could train some model using a subset of data and then use that to impute values in the full data. For example, if I were to use MICE package, I would like something like following to exist:

> testMice <- mice(myData[1:100000,]) # runs fine  
> testTot <- predict(testMice, myData) # hypothetical

Running the imputation on whole dataset is computationally expensive, so I would run it on only the first 100K observations. Then I would try to use the output to impute the whole data.

Sonu Mishra
  • 213
  • 2
  • 8
  • 1
    Why do you say running imputation on this dataset takes forever? A 4 million by 17 size dataset is quite small in my opinion and there is no reason you shouldn't be able to run any number of imputation methods on this dataset. – StatsStudent Jul 16 '16 at 05:08
  • I ran mice. It ran for hours – Sonu Mishra Jul 16 '16 at 05:09
  • I'm aware of mice, but don't use it. I usually use SAS's imputation procedures which run quite quickly. I'd recommend using the method of multiple imputation by chained equations (which should be built into MICE and is what the acronym stands for) in SAS. This method iterates through your data, imputing one variable at a time conditional on the others and so it's quite speedy for large datasets. If you don't have access to SAS, you might want to try parallelizing your R program using many of the packages available for doing so. – StatsStudent Jul 16 '16 at 05:13
  • 4
    If you don't want to do what I've suggested above, what you are suggesting should be okay. Before you run your analysis, be sure you take a random sampling of your data instead of grabbing just the first 100K observations in case there is some order effect that will affect the imputations. – StatsStudent Jul 16 '16 at 05:20
  • 1
    Also have you seen this? http://stats.stackexchange.com/questions/100020/how-to-perform-imputation-of-values-in-very-large-number-of-data-points – StatsStudent Jul 16 '16 at 05:21
  • 2
    One last thought after looking at the MICE documentation briefly. Try changing the fastppm option to norm. This should speed completion of the imputation substantially. – StatsStudent Jul 16 '16 at 05:25
  • 1
    And lastly, try using other imputation packages like Amelia. It seems faster too. Here's an example: http://fastml.com/impute-missing-values-with-amelia/ – StatsStudent Jul 16 '16 at 05:27
  • @StatsStudent - you did good work here and seem to have solved the problem. You could write it up into an answer. – EngrStudent Oct 29 '20 at 11:34

0 Answers0