4

Building a model on a large subset of your data and testing it on a holdout set is a common practice. I am interested to what extent this is a justifiable approach in various data laden settings.

Consider the case where you have a large data set in which the generating process of the data may change in some ways with time. Suppose we wish to build a GLM to model this data. I have been told by some practitioners in this setting that it is advisable to, for example, build your GLM on all data excluding the last 12 months worth of observations, and testing the performance on the holdout. It seems crazy to me, but this model is then put into production.

While I see the value of trying to get an accurate handle on the generalization error, our overall aim is to make good predictions in the coming years. It is not that hard to imagine that last year's data might be quite important to capturing any recent trends that will be relevant in the coming years. Is there a nice example to illustrate the potential downsides to this approach? It seems that predictions can suffer quite a bit.

I have been advocating building the model on the full data and estimating overfitting via a bootstrap. Is there a case I can make?

Lepidopterist
  • 712
  • 5
  • 23
  • If the holdout test accuracy is good and you're going to production you should train on every piece of data you have. – photox Feb 28 '17 at 00:20

1 Answers1

2

You use cross-validation generally for two purposes

  1. Model selection and model structure selection
  2. To determine the generalization capability of the model

Once you have completed above step; train you model on all of the data.

Like @photox commented, you are advised to use full (all samples) data to train the final model to be deployed in production.

However, if you are planning to determine the generalization capability of the model, and if your data is time-series data; you are advised to use last 12 months data for validation. This technique often called roll-over cross-validation or forward chaining is popular in time-series domain. For details see here.

discipulus
  • 726
  • 4
  • 14