Training set target categories' distribution

Question

In a book I'm reading I've come across the following quote:

Accuracy on the test set is a good performance measure only when there is a relatively uniform distribution of target categories in the training set.

If the real world categories are distributed 10% A and 90% B, then should the training categories also be 10% A and 90% B, or should they be 50% A and 50% B?

[Searching for "oversampling"](https://stats.stackexchange.com/search?q=oversampling) gives you many hits, e.g., [What problem does oversampling, undersampling, and SMOTE solve?](https://stats.stackexchange.com/q/285231/1352) and [When is oversampling poor practice?](https://stats.stackexchange.com/q/297887/1352) and [When is unbalanced data really a problem in Machine Learning?](https://stats.stackexchange.com/q/283170/1352) — Stephan Kolassa, Nov 21 '17 at 11:13
In addition, [accuracy has problems even for balanced data](https://stats.stackexchange.com/a/312787/1352). — Stephan Kolassa, Nov 21 '17 at 11:15
@Sycorax Thank you for the link. I'm going over it now, and though it is full of relevant information it really doesn't answer my question. I'll see if I can use it to either formulate my own answer or I'll add additional minutiae to the question if necessary. **Thank you**. — dotancohen, Jul 25 '18 at 09:30
@steffen Thank you, that link is much more relevant! I'll go over it more in depth, in fact this might be a dupe of it. If so, I'll close it as such. — dotancohen, Jul 25 '18 at 15:27

Training set target categories' distribution

0 Answers0