0

I have a dataset with 83% of positive class and 17% of negative class.

While this seems to be a imbalanced dataset based on outcome class proportion and features also don't show any variabity ( to differentiate these two classes).

So, I would like to know whether there is any standard measure that can indicate how imbalanced is the data?

For ex: if the measure has a value of 1, it is highly imbalanced and 0 means, it is highly balanced.

Is there anyway to find this using the python approach?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
The Great
  • 1,380
  • 6
  • 18
  • 3
    [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa May 27 '21 at 15:20
  • 3
    It sounds like you are asking for a universal way to translate the clear, quantitative numbers "83% and 17%" into a qualitative expression like "highly imbalanced." That sounds like a giant step backwards. Could you explain what the purpose of this might be? – whuber May 27 '21 at 15:23
  • I ask to know whether there is a standard measure (like some coefficient) which can indicate how balanced/imbalanced a dataset is. This is because, sometimes I may think that cclass proportion of 66:34 is imbalanced, but others may not feel this is heavily imbalanced. So I was thinking if there had been some objective measure to indicate balaceness of the dataset, that would be helpful...so, I can know whdther to oversample it or not.. – The Great May 27 '21 at 15:32
  • 1
    To measure if the classes are imbalanced, why not just measure the proportion or the Bernoulli variance $p(1-p)$? Those won't put the imbalances into objective bins like "moderate imbalance" and "incontestable, severe imbalance" but such bins are not so desirable. // Regarding oversampling, please read the link in Stephan Kolassa's comment. Class imbalance is less of a problem than you think. – Dave May 27 '21 at 15:54

1 Answers1

1

Simply represent imbalance as the percent in the minority class. It's simple and everyone understands it. Do not represent it as a ratio-- this gets highly non-linear and doesn't generalize to multi-class problems.

In your case, report the minority class as 17%. A perfectly balanced binary-class dataset would be 50%. If we have 100 classes that are perfectly balanced, we'd expect 1% for the minority class, which is already hard; but if the minority class in this case were 0.001% we know it's even more imbalanced; it can be useful to report the number of train/test examples in the minority class as well.

If you can/want to report more than a single scalar, report all the percentages, sorted; (or summarize them as the min and median across the classes). This lets us differentiate between one rare class, and several rare classes.