3

When I watch presentations where machine learning algorithms were used, the amount of data put in the training and validation sets seems to be somewhat arbitrary. Sometimes it's 80-20, sometimes it's 90-10. My completely naive approach would be 50-50 (because, hey, that sounds like a fair division, right?)

Is there any actual math out there which demonstrates the optimal way to size the sets, or is it all based on, "This seems to work okay most of the time"?

Andrew Klaassen
  • 345
  • 2
  • 12
  • 1
    Hope this helps, http://stats.stackexchange.com/questions/81820/why-is-k-fold-cross-validation-a-better-idea-than-k-times-resampling-true-valida/81824#81824 – lennon310 Mar 10 '14 at 21:09
  • This has a good discussion over the same topic http://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio – Bach Apr 16 '16 at 04:03

1 Answers1

1

There is no such math, because it would be invalid. Statistics is the crow-bar that is applied to natural philosophy to make science, to make sense of the universe. There is no "golden bullet" to make the universe fall into perfect understanding.

There are infinitely many cases. Some are inverses of each other. Some are very extremal cases. Some are similar to each other. For a case where a given assumption is true, there are an infinite number of cases for which that assumption is false.

Heuristics are often experience based, but have the advantage that the tend to work in many cases, often the more frequently observed cases. They are a generalized abstraction.

Approaches considered:

  • use train-validate-test to determine the general form of the model, then use 100% of the data for the fit.
  • leave one out validation
  • cross validation, where the split-train-validate process is repeated several times and the ensemble result is evaluated
  • various splits (50/50, 80/20, 90/10)
  • look at the distribution of the results, and use it to inform the split
EngrStudent
  • 8,232
  • 2
  • 29
  • 82