Subsampling the "right" amout of data to train an ML model

Asked Apr 15 '21 at 07:18

Active Apr 15 '21 at 07:18

Viewed 16 times

I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset.

Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that nobody ever tried to create a proper solution since this seems to be a very common problem.

Some additional criteria:

I am using sub sampling on a stream of data so it would be better to find something that works in this setting
I would prefer to avoid training the classifier more than once since it takes some time
I appreciate theoretically justified approaches

Any suggestion or reference?

asked Apr 15 '21 at 07:18

giz

Subsampling the "right" amout of data to train an ML model

0 Answers0