2

I have a multi-class and multi-label classification problem, i.e.: each sample can have more than one label associated to it and there is a total number of M possible labels.

e.g.:

  • y[0] = [0]
  • y[1] = [0, 1]
  • y[2] = [1, 4, 3, 0]
  • y[3] = [0, 1]
  • ...
  • y[100] = [1, 0, 3]

Counting the number of occurrences of each label, I can see that some labels are way more frequent than others. In the example above, for instance, 0 appears more often than 1, 3 and 4.

I can't figure out a smart (over-)sampling strategy to have a dataset where each label appears approximately the same number of times.

Any papers/idea on that?

fsamu
  • 106
  • 5
  • Count the frequencies of all the labels in a table and use these as weights? – user2974951 Oct 08 '18 at 06:35
  • Here is a paper that discusses the same problem: Giraldo-Forero, et al (2013). Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm, J. Ruiz-Shulcloper and G. Sanniti di Baja (Eds.): CIARP 2013, Part I, LNCS 8258, pp. 334–342 ([pdf](https://link.springer.com/content/pdf/10.1007%2F978-3-642-41822-8_42.pdf)) – nightrain Oct 15 '18 at 15:28
  • 1
    Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave May 10 '21 at 03:24

1 Answers1

0

Since each sample can have more than one label associated with it one possible solution would be to train $M$ logistic regression models, where the response variable for label $i$, sample $n$, is $Y_{i,n}$ where $Y_{i,n}=1$ if sample $n$ belongs to class $i$ and $Y_{i,n}=0$ if it does not.

For a new sample $Z$ your model output can be interpreted as the probability $Z$ belongs to each individual class.

David Veitch
  • 947
  • 6
  • 12