0

In a book I'm reading I've come across the following quote:

Accuracy on the test set is a good performance measure only when there is a relatively uniform distribution of target categories in the training set.

If the real world categories are distributed 10% A and 90% B, then should the training categories also be 10% A and 90% B, or should they be 50% A and 50% B?

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
dotancohen
  • 400
  • 1
  • 9
  • 1
    [Searching for "oversampling"](https://stats.stackexchange.com/search?q=oversampling) gives you many hits, e.g., [What problem does oversampling, undersampling, and SMOTE solve?](https://stats.stackexchange.com/q/285231/1352) and [When is oversampling poor practice?](https://stats.stackexchange.com/q/297887/1352) and [When is unbalanced data really a problem in Machine Learning?](https://stats.stackexchange.com/q/283170/1352) – Stephan Kolassa Nov 21 '17 at 11:13
  • 1
    In addition, [accuracy has problems even for balanced data](https://stats.stackexchange.com/a/312787/1352). – Stephan Kolassa Nov 21 '17 at 11:15
  • @Sycorax Thank you for the link. I'm going over it now, and though it is full of relevant information it really doesn't answer my question. I'll see if I can use it to either formulate my own answer or I'll add additional minutiae to the question if necessary. **Thank you**. – dotancohen Jul 25 '18 at 09:30
  • @steffen Thank you, that link is much more relevant! I'll go over it more in depth, in fact this might be a dupe of it. If so, I'll close it as such. – dotancohen Jul 25 '18 at 15:27

0 Answers0