6

Assume I have a couple of thousand hens that I want to classify into those that never lay an egg and those that will at some point in their life lay an egg. Assume that already works perfectly.

Now there are a few hens who do lay eggs, but at some point won't do so for a couple of years. Those hens are a really small minority - let's say a hundred.

Now, i want my network to classify a hen that at some point will lay an egg but won't do so for a couple of years as a third class.

My intuition tells me, if I oversample the minority category, my model will simply memorize those hundred examples and fail at generalizing.

However when using weights, my intuition would tell me that my model can't memorize those samples because it doesn't encounter them frequently enough - kind of like a higher learning rate leads to better generalisation, but worse fitting due to the coarse steps.

However all the posts on CrossValidated actually say that oversampling works better - but why? Is that also the case for really small classes like in my case?

BigBadWolf
  • 163
  • 2
  • 1
    Good news! Class imbalance is not a problem! Where on Cross Validated are you reading that oversampling is the way to go? https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Jun 18 '21 at 12:52
  • It seems I accidentally landed on the right page. I was reading mostly on datascience.stackexchange, now I'm going on a deep dive through those links you sent me. I do get that accuracy is a bad measure - but what would constitute a better loss function for machine learning? I can't really find that in any of those questions. – BigBadWolf Jun 18 '21 at 13:57
  • 2
    Log loss and Brier score (square loss) are the big ones. The term to search is "strictly proper scoring rule". – Dave Jun 18 '21 at 14:09
  • 1
    You’re mentioning using a “network“ for a hundred samples. That doesn’t sound like a great idea. Neural networks flourish when you have lots of data. I’d recommend starting with a much simpler algorithm. – Tim Jun 18 '21 at 21:58

1 Answers1

7

This depends at least a little on the model being used. Most often, simple oversampling is asymptotically equivalent to using class weights: an integer weight $w$ on a datapoint has an equivalent effect on loss calculations as duplicating the datapoint $w$ times. Oversampling then is just a discrete version of class-weighting, so asymptotically they should be equivalent, but also for small samples sizes it doesn't seem clear that the discrete version should lead to consistently more or less overfitting.

If your model does any bagging though, things change: by oversampling, you are likely to include a subset of the duplicates of one point, whereas when weighting the subsetting happens before the weights come into play. However, it's still not clear to me that the final effect will be positive or negative in the sense of overfitting. (Unless you're also planning on using out-of-bag scores, in which case this would be quite bad, being very similar to the resampling-before-splitting in cross-validation.)

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15