2

I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset.

Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that nobody ever tried to create a proper solution since this seems to be a very common problem.

Some additional criteria:

  • I am using sub sampling on a stream of data so it would be better to find something that works in this setting
  • I would prefer to avoid training the classifier more than once since it takes some time
  • I appreciate theoretically justified approaches

Any suggestion or reference?

giz
  • 21
  • 2

0 Answers0