11

In cases where there is a substantial difference in relative class frequencies, it could be that the density of the minority class is never higher than the density of the majority class anywhere in the attribute space. Here is a simple example using univariate Gaussian classes, with an imbalance ratio of 1:9.

enter image description here

In this case, if my classifier assigns all patterns to the majority class, it is doing exactly the right thing, and there is no problem to solve.

In this case, we know the true data generating process, so we know that the classifier is doing the right thing. However in general we don't know the true distributions of positive and negative classes, so we don't know whether the classifier is doing the right thing or not.

So my question is: In practical applications, how do we decide if we have a class imbalance problem, or whether the classifier is just giving the correct answer, to the question as posed?

Full disclosure: My intuition is that in most cases, especially when the data is not unduly scarce, the classifier is doing exactly what it should do and there is no class imbalance problem. I am primarily interested to hear how other practitioners and researchers diagnose class imbalance problems.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 3
    I recently went down a really interesting rabbit hole in this area. I recommend this [question](https://stats.stackexchange.com/q/357466/105620) and this (somewhat un-related) [answer](https://stats.stackexchange.com/a/312787/105620). – TravisJ Aug 09 '21 at 14:15
  • Yes, they are interesting, unfortunately they are rather over-stated, as class imbalance *can* cause a problem with the estimation of parameters, both of probabilistic classifiers as well as of discrete classifiers like the SVM, and proper scoring rules are no panacea. However I would very much like to avoid the discussion being diverted along those lines which have already been discussed elsewhere. – Dikran Marsupial Aug 09 '21 at 14:21
  • I do not see the problem, and you even admit that the model is doing exactly what it is supposed to do. Perhaps you could clarify what you see as the problem. (Do you just mean checking if the class imbalance is because our data are biased?) – Dave Aug 09 '21 at 14:31
  • 1
    @Dave In *this* instance it is doing the right thing, but it is not that difficult to construct cases where there is a non-trivial decision boundary and the classifier is biased against the positive class. I am asking how practitioners decide which is which in their application (or indeed if they do decide). A lot of the answers relating to class imbalance here are rather, err, imbalanced, one way or the other, and it is rather more of a nuanced issue than seems to be appreciated. – Dikran Marsupial Aug 09 '21 at 15:39
  • 1
    Could you give an example of what you mean by the decision boundary being non-trivial and the classifier (probability model...) being biased against the positive class? I do not follow. – Dave Aug 09 '21 at 15:45
  • just move the red distribution to the right until it *just* peeks out from under the blue one. At that point there will be an area of the attribute space that the classifier should assign to the positive class, but the "class imbalance problem" may result in the classifier failing to do so. – Dikran Marsupial Aug 09 '21 at 15:51
  • You do that knowing the true populations. When we do not know the true populations, we have to rely on signal-to-noise ratio. Until that becomes very strong, what business do we have in going strongly against the prior probability that favors blue? If we have excellent estimates of both the red and blue distributions, we will be able to get red to peak through and be identified as more likely to be red. – Dave Aug 10 '21 at 13:49
  • @dave, "Until that becomes very strong, what business do we have in going strongly against the prior probability that favors blue?" we want to make the best inferences we can based on the data we have. Relying on the prior probabilities (and ignoring the inputs) when we should not is what the class imbalance problem is about. That is the point of the question. If our estimates are biased, how do we detect the bias before taking steps to compensate for it? "? If we have excellent estimates" yes *IF* in statistics we can't just give up if we don't have enough data for *excellent* estimates. – Dikran Marsupial Aug 10 '21 at 14:35
  • This discussion is extremely interesting. In practical cases: 1) can't we just look at recall scores? 2) We could train two classifiers, one that neglects unbalancedness and one that tries to cope with it and see by how much they differ. These could provide two diagnostics to check the severity of the problem. We could then adjust our classification according to our needs. In some cases, you can simply neglect bad classification on the minority class(es); in other cases, you want to use dedicated strategies; yet in other cases, data simply are insufficient to answer the question. – luco00 Aug 15 '21 at 17:43
  • @lico00 the problem with recall scores is that sometimes accuracy *is* the thing we are really interested in, see the example here (https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/538524#538524). I don't think just looking at recall can tell you if you are getting the threshold in the right place. – Dikran Marsupial Aug 16 '21 at 06:50
  • Also training one model where we don't do anything about class imbalance and one where we do, is hop to tell which one is doing the right thing (for many models the only think that will change is the bias parameter). – Dikran Marsupial Aug 16 '21 at 08:06
  • @DikranMarsupial I am not saying that recall alone is enough, I am saying that you can use to diagnose the problem. The same is true for the competing classifiers. So, I think that the question you pose cannot be answered in general; you need a context. If you want to diagnose a rare, lethal disease, you WANT to identify instances from that class, and your intuition does not apply here. In this case you want you classifier to be biased toward positive examples. A very low recall score should ring a bell. – luco00 Aug 16 '21 at 12:01
  • Also, there are cases where the classes are heavily imbalanced but there are predictors which tell them apart very well (in your example, the C+ class would be shifted far at the right of the blue one). In such cases, (good) classifiers with and without mechanisms to cope with class imbalanced would yield the same results. If you are in such a situation your concerns about class imbalances should be alleviated. Of course, if you face the opposite situation you wouldn't blindly go with the classifier coping with imbalance and/or with highest recall. These are only attempts to diagnose the prob – luco00 Aug 16 '21 at 12:07
  • @luco00 *how* can you use it to diagnose the problem. If false positives are more important than false negatives, then replace accuracy with expected loss, but the same problem crops up. It is a problem with thresholds that you can't diagose by looking at recall (unless you can give a specific procedure). "A very low recall score should ring a bell." as I pointed out, a recall of **zero** is not evidence of a class imbalance problem as it may be the optimal solution - see the diagram. – Dikran Marsupial Aug 16 '21 at 12:15
  • If the classes are well separated then it is likely that there will be high accuracy / low expected loss and hence reason to think there is no class imbalance problem. The trouble is that in practical applications we can't visualise the data well and we don't know the true model, so we don't know what constitutes high accuracy or good recall. The univariarte Gaussian examples are just there to show why even zero recall doesn't imply a class imbalance problem, but we need a diagnostic method for real data which is more complex. – Dikran Marsupial Aug 16 '21 at 12:22
  • As I said, it is not a definitive answer. You are asking how others do it in practice. I think that a researcher typically has some clue on the topic. Thus, the values you obtain on recall (which is just an example) may be an hint that your classifier does not behave the way you would like because of suspected class imbalance. Comparing two classifier might be another hint. There are examples that fall within the one you gave where these measures are ridiculous to use and others where they can help to at least get a sense of the issue. – luco00 Aug 16 '21 at 12:51
  • Yes, this is entirely the problem. We may have suspicions/intuitions, perhaps based on test accuracy or recall, but what do we do to confirm or refute them? How do we know that (say) resampling isn't over-compensating for a class imbalance problem that doesn't actually exist? So if we compare the two classifiers, how do we know which is "better"? Even a zero recall doesn't mean you have a class imbalance problem. – Dikran Marsupial Aug 16 '21 at 12:55
  • I'll probably add an answer with my thoughts after the bounty period expires. – Dikran Marsupial Aug 16 '21 at 13:46

3 Answers3

2

I challenge that there is a problem. Let's go with the scenario you described in the comments where your red graph is shifted to the right a bit.

enter image description here

I will make up some (plausible) numbers and go through Bayes' theorem.

$$ P(\text{red}) = 0.2$$$$P(\text{blue}) = 0.8 $$

$$ P(X>3\vert \text{red}) = 0.6$$$$P(X>3\vert \text{blue}) = 0.05 $$

Now Bayes' theorem:

$$ P(\text{red}\vert X>3) = \dfrac{P(X>3\vert\text{red})P(\text{red})}{P(X>3)} $$

$$P(X>3) = P(X>3\cap\text{red}) + P(X>3\cap\text{red}^C) $$$$= P(X>3\cap\text{red}) + P(X>3\cap\text{blue})$$ $$ = P(X>3\vert \text{red})P(\text{red}) + P(X>3\vert \text{blue})P(\text{blue}) $$$$=(0.6)(0.2) + (0.05)(0.8) = 0.16$$

Now let's put it all together in Bayes' theorem.

$$ P(\text{red}\vert X>3) = \dfrac{(0.6)(0.2)}{(0.16)} = 0.75 $$

That's a much larger probability of being red than the prior probability of $0.2$.

Varying the prior probability of being red reveals a similar story of consistently having a higher posterior probability of being red than prior probability of being red.

posterior <- function(x, y, z){
    return(x*y/(x*y + z*(1-x)))
}
prior <- seq(0, 1, 0.0001)
plot(prior, posterior(prior, 0.6, 0.05), xlab = "Prior of Red", ylab = "Posterior of Red", col = 'red')
lines(prior, prior)

The class imbalance does not overwhelm the posterior probability, and I have tried this with even smaller shifts of red to the right. A tiny shift results in a plot that is very close to the diagonal, but it still bends up a little bit.

enter image description here

Dave
  • 28,473
  • 4
  • 52
  • 104
  • 2
    This does not address the question. In practice we do not know the true distributions. The class imbalance problem is an estimation problem - we don't know the true parameters of the distribution and have to estimate them from the data. If the error in those estimates cause a bias in the decision threshold that goes against the minority class, we have an example of the class imbalance problem. That cannot be demonstrated by an example where the true parameters are known. I want to know how people diagnose a class imbalance problem in operational use. – Dikran Marsupial Aug 11 '21 at 06:52
  • The point of my diagram was to show that we can't use all patterns being assigned to the majority class as an indication of a class imbalance problem because sometimes that *is* the optimal solution. So how do we rule out that possibility? How do we decide to use resampling or re-weighting? I suspect in a lot of cases these techniques are used simply because people are following a roadmap/recipe and are making their classifier worse by "correcting" an already optimal classifier. However, it may be that there is a diagnostic method I don't know about. – Dikran Marsupial Aug 11 '21 at 07:02
  • @DikranMarsupial I still do not see why the posterior probability of a well-calibrated model does not work for you. Do you just wonder if declaring a low probability of the minority class is due to class imbalance or inadequate signal? – Dave Aug 11 '21 at 15:21
  • the problem is that it won't be a well-calibrated model because of the difficulty in estimating the parameters. The fact that the model gives a low true positive rate is an indication that the probabilities cannot be well calibrated, at least around the decision boundary. The problem is that there are too few positive examples to adequately define the distribution of the positive class, and that results in bias. So how do we detect if that bias is present? – Dikran Marsupial Aug 11 '21 at 15:27
  • What would you consider the decision boundary? Proponents of proper scoring rules like Frank Harrell and @StephanKolassa oppose the idea of a decision boundary until late stages, preferring to get well-calibrated probabilities. – Dave Aug 11 '21 at 15:29
  • The decision boundary is set by the missclassification costs, 0.5 if they are equal and the priors representative. I gave a counter example on another thread (https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models/538524#538524) where a proper scoring rule selects the wrong model, so that is not a panacea. But as I said, the class imbalance problem (when it is actually present) means the probabilities are *not* well calibrated, so deferring the decision doesn't help (and indeed can make it worse, as I demonstrated). – Dikran Marsupial Aug 11 '21 at 16:30
  • @DikranMarsupial Your linked post is an interesting one, though I don't quite see the relevance to this question. // Why do you say that class imbalance results in poor probability calibration? – Dave Aug 13 '21 at 16:10
  • I didn't say it did. I said it showed that proper scoring rules can favour the wrong classifier and so it is not a panacea. The argument for using proper scoring rules is that we want a model with well-calibrated probabilities. It ought to be obvious that the class imbalance problem causes badly calibrated probabilities. If they were well-calibrated there would not be a bias in the position of the p = 0.5 decision boundary (or any other probability), but that bias is observable - it affects probabilistic classifiers as well. – Dikran Marsupial Aug 13 '21 at 16:17
  • 1
    That is not obvious, and playing around with `rms::calibrate` shows that the class imbalance can be quite a bit ($600:1$) without bad miscalibration. `set.seed(2021); N – Dave Aug 13 '21 at 16:23
  • 1
    Class imbalance is a problem when you don't have enough data to properly characterize the minority class. The imbalance itself isn't the problem, it is that imbalanced problems tend to be the ones with too little data for one of the classes. You are not going to see a class imbalance problem with that many examples. (BTW I don't speak R, so I am not completely sure I understand the code). Most estimation problems are solved by adding lots of data. – Dikran Marsupial Aug 13 '21 at 16:26
  • 2
    It fits a logistic regression to $200000$ points that have a class imbalance of over $600:1$, then plots the calibration curves the way Harrell does it. // This is the first time that I am understanding your question, which seems to be about lacking much data on the minority class rather than there being a class imbalance. That is not clear in your original post, and I think clarifying that point might result in more satisfactory answers to your (interesting!) question. – Dave Aug 13 '21 at 16:31
  • That *is* the class imbalance problem. I don't think *any* classifier has a problem with imbalanced classes if you throw enough data at it, the SVM certainly doesn't. The trouble with being overly prescriptive about what *I* think the problem is is that *I* might be missing something (or just be wrong). If someone can demonstrate a class imbalance problem with 200,000 points for any decent classifier, I want to hear about it! ;o) – Dikran Marsupial Aug 13 '21 at 16:38
  • BTW, I did hint at my understanding of the class imbalance problem when I wrote " My intuition is that in most cases, **especially when the data is not unduly scarce**, the classifier is doing exactly what it should do and there is no class imbalance problem." [**emphasis** mine]. IIRC there are answers on class imbalance questions that make this point made by other contributors, so I don't think it is just me. – Dikran Marsupial Aug 13 '21 at 18:29
  • That's the part that I don't get: if you believe that the model is doing exactly what it should be doing by being skeptical about membership in the minority class (the correct behavior in your drawing), then what is the issue? // When I see someone describe class imbalance as a problem, I mostly think that they want to use accuracy, only to realize that naïve guessing based on the prior distribution (class ratio) gets them the right answer $99.9\%$ of the time, so their $98\%$ accuracy is not quite the $\text{A}+$ grade it first appears to be. – Dave Aug 13 '21 at 18:35
  • the point is that in practice we don't know whether the model is doing the right thing or not, which is why I wanted people to tell me how they diagnose the problem. The "get the right answer using the priors" is a red-herring. The bias parameter gets you that for free, but the training criterion won't be minimised by just doing that, so the rest of the model ought to learn the rest. Measuring the improvement over just using the priors would just be an affine transformation of the accuracy, which overcomes our cognitive bias about this (that optimisation algorithms do not share). – Dikran Marsupial Aug 16 '21 at 06:54
  • Part of the problem is how much data do you need to be confident that the model is doing the right thing? For simple logistic regression models, it isn't going to be very much, but it depends on the complexity of the classification problem (e.g. how high-dimensional it is, how complex are the manifolds on which the data exist). This means that "rules of thumb" are unlikely to be helpful, so we need a diagnostic test if we hope to reliably take action without making things worse rather than better. – Dikran Marsupial Aug 16 '21 at 07:36
2

I'm going to have a go at explaining why I think detecting a class imbalance problem is likely to be difficult because of the paucity of data when we actually do have a problem.

Consider a univariate normal pattern recognition task, with a 19:1 ratio of negative to positive examples (so that classifying everything as negative gives an accuracy of 95%), but where a decision boundary could be drawn giving an accuracy better than 95%. The ideal distributions and decision boundary are shown below:

enter image description here

The generalisation performance of the ideal classifier is as follows:

  • TPR = 0.318385
  • FNR = 0.681615
  • TNR = 0.993286
  • FPR = 0.006714
  • ERR = 0.040459
  • ACC = 0.959541

where TPR is the true positive rate, FNR is the false negative rate, TNR is the true negative rate, FPR is the false positive rate, ERR is the error rate and ACC = 1 - ERR is the accuracy.

Assume the variances of both classes are know, so we only need to estimate the class means. Unfortunately, if we have to estimate the means from only a small sample of data, we might be unlucky and end up with a model where the decision boundary is so far from areas of high data density that we may as well classify everything as belonging to the majority negative class. This is an example of the class imbalance problem, because the uncertainty in estimating the parameters leads to a bias against the minority positive class. Here we have a model with 152 negative patterns and 8 positive patterns:

enter image description here

I didn't have to work to hard to be unlucky, this is only the 21st seed of the random number generator I tried. The training set statistics are:

  • TPR = 0.00
  • FNR = 1.00
  • TNR = 1.00
  • FPR = 0.00
  • ERR = 0.05
  • ACC = 0.95

Clearly this is not very good, it is no better than classifying everything as negative.

So lets see if we can detect this problem by having a validation set, again with 152 negative examples and 8 positive examples, in the same ratio as the training set:

  • TPR = 0.00
  • FNR = 1.00
  • TNR = 1.00
  • FPR = 0.00
  • ERR = 0.05
  • ACC = 0.95

Oh dear, the validations set suggests this is a case where no meaningful classification is possible. However, we know that is not true in this case, by construction. The problem is that, like the training set, it is only a small sample of data, and we have just been unlucky again. If we were to sample some more validation data, we might get a different result. However, if we could collect more data, we would use it for training the model and we would get better parameter estimates and the class imbalance problem would likely go away.

So my initial thought was to see if we could make a Bayesian test of whether it was plausible that there may be a non-trivial decision to be made, given the training data we actually have. If we choose an improper flat prior,our posterior distribution for the class means are Gaussian distributions, centered on the sample means, with standard deviations given by the standard errors of the means (in agreement with the frequentist confidence intervals). We can then perform a Monte Carlo simulation, of say 2^20 samples (as they can be collected so cheaply in this case and I like round numbers), and estimate the posterior distribution for the decision boundary.

enter image description here

About 79% of the 2^20 samples gives a threshold that is in an area of high data density, the remaining 21% are so far to the right of both classes that essentially all patterns will be classified as negative. We can also look at the posterior distribution for the true positive rate:

enter image description here

This suggests that there is some chance of a meaningful classification. Let's make an arbitrary threshold at which we might consider a true positive rate as "meaningful" at 0.05. The proportion of Monte Carlo samples, for which the TPR >= 0.05 is about 22.7%, so in this case, we might diagnose the plausibility of a class imbalance problem.

However, what happens if we try it again, but this time for a problem where classifying everything as negative is more or less optimal:

enter image description here

where the optimal model's generalisation performance is summarised by:

  • TPR = 0.007254
  • FNR = 0.992746
  • TNR = 0.999714
  • FPR = 0.000286
  • ERR = 0.049909
  • ACC = 0.950091

Again we have to estimate the class means from a small dataset with 152 negative examples and 8 positive examples, and again we are unlucky,

enter image description here

The training set performance is given by:

  • TPR = 0.25
  • FNR = 0.75
  • TNR = 1.00
  • FPR = 0.00
  • ERR = 0.0375
  • ACC = 0.9625

and the validation set performance by

  • TPR = 0.125
  • FNR = 0.875
  • TNR = 1.000
  • FPR = 0.000
  • ERR = 0.04375
  • ACC = 0.95625

In this case, the Monte Carlo simulation is very confident that a meaningful classification is plausible

enter image description here

The proportion of Monte Carlo samples giving a TPR >= 0.05 is about 74.5%, when of course we know by construction that the optimal model assigns all patterns to the negative class.

This suggests the Bayesian analysis can suggest that a meaningful classification is plausible, even though we have a classifier that ostensibly classifies all patterns to the negative class. In that situation, we may want to think of doing something to alleviate the problem. However, such a test can't tell us when we should be classifying everything as negative.

Anyway, that was the sort of answer I was hoping for, but I'd much prefer something that actually worked in practice! ;o) I may well offer a second bounty if someone can provide something substantially better than this.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
1

Well, I think the lack of an answer that explains how to detect whether class imbalance is a problem in a particular application, even when a modest bounty of +50 reputation was on offer, suggests cause for concern about research on the topic of class imbalance. I suspect practitioners are frequently re-balancing or re-weighting the datasets simply because they are imbalanced, rather than because the imbalance is actually causing a problem. I further suspect that often this is just making matters worse by over-compensating (e.g. by fully balancing the dataset).

Class imbalance can cause a problem when there are two few examples of the minority class to adequately characterise it's statistical distribution. When this happens, the decision boundary does tend to be unduly biased in favour of the majority class. However, as you add more data, the problem goes away. This shouldn't be a surprise. If you have a large enough neural network, it will be a universal approximator, able to implement essentially any (one-to-one or many-to-one) mapping between the input and output spaces. If it is fitted using a proper scoring rule then asymptotically it will output the true posterior probabilities of class membership. So if you have enough data, it doesn't matter how imbalanced the problem is, a complex enough model will learn the optimal decision surface.

I think any means of detecting and dealing with class imbalance problems will be very tricky though. Essentially if there is a bias, you will want to re-sample or re-weight the training sample just the right amount to compensate for the bias due to the "imbalance". Exactly balancing the dataset is likely to way over-compensate and make accuracy (or expected loss) worse rather than better. The trouble is, if you don't have enough data to describe the minority class, where are you going to get the data to choose the optimal degree of bias? I suspect the best approach will be some Bayesian scheme that determines what the plausible true positive rate (for example) could be if the model were correct.

Essentially, I know from experience that class imbalance can cause estimation problems, in a small data setting, but I'm not convinced that there is a great deal we can do about it because we don't have enough independent data to tune the compensation applied. I think we should be very wary of up/down sampling or reweighting simply because there is an imbalance, and if we do, we need to be able to determine whether it has worked or not. This requires at least that we know what criterion is important for our application, and why it is important. No application is primarily interested in the true positive rate, if that were true, we would just assign everything to the positive class and go home satisfied with having done the optimal job! ;o)

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 2
    This could be an interesting reading: Wallace, Dahabreh (2012) Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them) – luco00 Aug 19 '21 at 18:14
  • Thanks @luco00, I'll give it a read. It seems a common misapprehension that only discrete classifiers like the SVM have problems with class imbalance, it isn't true, probabilistic classifiers have the same problems. – Dikran Marsupial Aug 19 '21 at 18:21
  • Ah, I think that is one I've seen before - I tried bagging down-sampled models, which was one of their recipies IIRC, but it made the results much worse. Again the problem is working out how much to compensate. – Dikran Marsupial Aug 19 '21 at 18:25
  • 1
    I think you should have a *careful* read of https://gking.harvard.edu/files/0s.pdf (as at least to me it's confusing). Yes, logistic regression MLE for small samples is biased (so not calibrated) . However the variance of small samples is much larger than the bias, so their actual recommendation (minimising MSE) is to use a biased estimator see fig 6 and 7, not removing the bias. reweighting/up down sampling are required for computational reasons not for reducing bias. My take is that methods of variance reduction (such as crossvalidated regularisation) are perhaps all that is required. – seanv507 Aug 30 '21 at 17:23
  • @seanv507 cheers, I've been reading that paper and some of the references today. In the experiments I have done, regularisation isn't sufficient to correct for the bias against the positive class. I've used differential sampling and weighting in the past, but only for situations where operational class frequencies are different from those in the training set. I'm not sure class imbalance problems can be fixed that way because there is no way to determine how much to up-sample or down-sample without over-compensating. It could be that it is just not something that can be reliably fixed. – Dikran Marsupial Aug 30 '21 at 17:34