How should the distribution of images for a classification problem look like?

Question

I know that it would be ideal if the distribution of images for two classes would be balanced. When I have a classification problem with two classes, each class should have around the same amount of images, so the network isn't biased to one of the classes. But why is this so? And are there papers where this problem is referenced or explained? I am writing my thesis and have 2 classes. For one class I have 2.000 images, the other class has more than 6.000 images. Because I only have 2.000 images for one class I also only took 2.000 images for the other class while training.

Would you say this is the correct way and if yes, why should this be done (with scientific paper reference if possible). I don't know if I am searching wrong but I could'nt find any paper on this.

Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Jul 07 '21 at 12:02

How should the distribution of images for a classification problem look like?

0 Answers0