I have a huge data (4M x 17) that has missing values. Two columns are categorical, rest all are numerical. Given the huge amount of data, running any imputation method runs forever. What should I do?
I was wondering if I could train some model using a subset of data and then use that to impute values in the full data. For example, if I were to use MICE package, I would like something like following to exist:
> testMice <- mice(myData[1:100000,]) # runs fine
> testTot <- predict(testMice, myData) # hypothetical
Running the imputation on whole dataset is computationally expensive, so I would run it on only the first 100K observations. Then I would try to use the output to impute the whole data.