3

I have a question regarding imputation I was not able to find an answer to. Any help would be greatly appreciated.

Let's suppose I have a dataset, impute missing values using the median, train a model and test it. The model has good performance. Now I put this model into production.

When I impute missing values for new records that keep getting fed into the model should I use the median of the original dataset or compute the median of the original dataset + the new records?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Rossella
  • 471
  • 1
  • 3
  • 8
  • Use the median of the original dataset. How else would you apply your model in a sample of size 1? – AdamO Jan 10 '18 at 14:47
  • I was also leaning towards using the median of the original dataset. Using the median of the original dataset + the one sample would also be a possibility but it does not sound right. Just wanted a second opinion. – Rossella Jan 10 '18 at 15:09
  • what you will get in that case is a good sequential estimate of the median, and predictions which are difficult to interpret and apply. – AdamO Jan 10 '18 at 15:17
  • A more general comment: using the median is not good practice for imputation. Better use a Bayesian multiple imputation model, see e.g. https://stats.stackexchange.com/questions/303722/methods-to-work-around-the-problem-of-missing-data-in-machine-learning/303737#303737 – tomka Jan 10 '18 at 16:23
  • Unfortunately I am strongly limited by what technology I am allowed to use since things need to be productionized really quickly. Python and R are out, I can only use Alteryx at the moment which has none of the advanced imputation methods. – Rossella Jan 10 '18 at 16:47
  • I do not know Alteryx, but using the median is really not a good idea. A simple (though still suboptimal) option is to use conditional mean imputation. If your outcome is continuous you could use the least square linear regression estimate as imputation, which is a simple matrix projection. – tomka Jan 10 '18 at 17:06
  • NB: you need to @ people (e.g. @tomka) as otherwise they are not notified – tomka Jan 10 '18 at 17:07

1 Answers1

1

As AdamO suggested, use the median from the original dataset.

The intuition is this: the data you used to develop the model should be large enough and representative to production distribution. Otherwise (production data is dramatically different) the whole model building process will not make any sense. If the model building data is representative and large, why bother to add few new production data?

Let's have a concrete example. Suppose we are building a model to predict housing price and the feature is number of bed rooms. The model building data (we can divide it into training and testing to build the model), has the median of 3 bed room. This numbers should be representative to the housing market overall. (say we had 500K data points /houses to build our model)

Now assume in production time, the first day is a special day and we have luxury 10 houses feed into the model, and these houses has ~8 bed rooms. In second day, that there are 5 houses with missing value.

Intuitively what would we do? To me, it is intuitive / natural to use 3 for imputation, because this number is calculated from a large representative historical data.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Thanks for the example. That is what I believed as well but just wanted to make sure since it is the first time I have to put a model into production. – Rossella Jan 10 '18 at 16:14