Is it good practice to perform model parameter tuning on a random subsampling of a large dataset?

Question

A lot of the datasets presented to us in the company at which I'm currently an intern are very large (many millions of rows / Gigabytes, or even Terabytes of data).

While running machine learning experiments, I find myself wanting to use (cross validated) grid searching algorithms to optimize hyper-parameters for the models I train. Time-wise, this is a very costly affair to do on the datasets described above.

Therefore I found myself wondering whether it would be a valid approach to take a smaller, random (or maybe stratified?) subsampling of the dataset to use in parameter tuning, so I can use these parameters to train the final model on a large portion of the dataset, or even the dataset as a whole?

Depending on the model that you're using, hyperparameters may not be portable between data of different sizes. One example is the minimum number of observations for a random forest to attempt a split: at small data volumes, some ranges of this number are effectively worthless, but may be productive at larger volumes. — Sycorax, Oct 26 '16 at 15:18

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

This question is really broad. Depends on the data and model, it can be a good practice and can be bad.

The overall idea is to think about the "complexity of data and model". We may need to review Bias and Variance trade-off, i.e., when under-fitting and over-fitting will happen and how to detect it.

How to know if a learning curve from SVM model suffers from bias or variance?

To your question about turning on samples: In general, the more complex the data is, with limited sample size, the harder to get "representative" samples.

If the data is "really complex", and samples are "representative", using samples to tune parameters is a bad practice. The way to fix is trying larger amount of samples, and use complicated models (such as neural network).

You can see my answer is really unclear in may parts, this is because it is hard to say the complexity of data and model, and how much samples are needed to to be "representative".

Thank you, this is a helpful perspective! I hadn't thought of it in this way yet. I also like your answer on the "Learning curve - bias or variance?" question. — TBZ92, Oct 27 '16 at 07:56

Is it good practice to perform model parameter tuning on a random subsampling of a large dataset?

1 Answers1