How to manually balance unbalanced multi-class/multi-label data?

Question

I have a multi-class and multi-label classification problem, i.e.: each sample can have more than one label associated to it and there is a total number of M possible labels.

e.g.:

y[0] = [0]
y[1] = [0, 1]
y[2] = [1, 4, 3, 0]
y[3] = [0, 1]
...
y[100] = [1, 0, 3]

Counting the number of occurrences of each label, I can see that some labels are way more frequent than others. In the example above, for instance, 0 appears more often than 1, 3 and 4.

I can't figure out a smart (over-)sampling strategy to have a dataset where each label appears approximately the same number of times.

Any papers/idea on that?

Count the frequencies of all the labels in a table and use these as weights? — user2974951, Oct 08 '18 at 06:35
Here is a paper that discusses the same problem: Giraldo-Forero, et al (2013). Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm, J. Ruiz-Shulcloper and G. Sanniti di Baja (Eds.): CIARP 2013, Part I, LNCS 8258, pp. 334–342 ([pdf](https://link.springer.com/content/pdf/10.1007%2F978-3-642-41822-8_42.pdf)) — nightrain, Oct 15 '18 at 15:28
Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, May 10 '21 at 03:24

score 0 · Answer 1 · answered Jul 30 '19 at 00:10

Since each sample can have more than one label associated with it one possible solution would be to train $M$ logistic regression models, where the response variable for label $i$, sample $n$, is $Y_{i,n}$ where $Y_{i,n}=1$ if sample $n$ belongs to class $i$ and $Y_{i,n}=0$ if it does not.

For a new sample $Z$ your model output can be interpreted as the probability $Z$ belongs to each individual class.

How to manually balance unbalanced multi-class/multi-label data?

1 Answers1