88

We already had multiple questions about unbalanced data when using logistic regression, SVM, decision trees, bagging and a number of other similar questions, what makes it a very popular topic! Unfortunately, each of the questions seems to be algorithm-specific and I didn't find any general guidelines for dealing with unbalanced data.

Quoting one of the answers by Marc Claesen, dealing with unbalanced data

(...) heavily depends on the learning method. Most general purpose approaches have one (or several) ways to deal with this.

But when exactly should we worry about unbalanced data? Which algorithms are mostly affected by it and which are able to deal with it? Which algorithms would need us to balance the data? I am aware that discussing each of the algorithms would be impossible on a Q&A site like this. I am rather looking for general guidelines on when it could be a problem.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52
Tim
  • 108,699
  • 20
  • 212
  • 390
  • 3
    Possible duplicate of [What is the root cause of the class imbalance problem?](https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem) – Matthew Drury Jun 02 '17 at 14:04
  • @MatthewDrury thanks, this is an interesting question, but IMHO, it has a different scope. What I'm asking is for guidelines when this is really a problem. Surely answering the *why* question leads to answering the *when* question, but I'm looking for precise answer for the *when* question. – Tim Jun 02 '17 at 14:12
  • 15
    Fair enough! I'm with you. The "literature" on this seems to be all about how to fix a problem, without bothering to convince you that there is in fact a problem to be solved, or even telling you in what situations a problem occurs or not. One of the most frustrating parts of the subject for me. – Matthew Drury Jun 02 '17 at 14:39
  • 4
    @MatthewDrury that is *exactly* the problem! – Tim Jun 02 '17 at 14:43
  • 2
    A total survey of methods is not within the scope of an SE question. Do you want to refine the question? – AdamO Jun 07 '17 at 16:02
  • @AdamO I am asking rather about general guidelines then total survey. – Tim Jun 07 '17 at 16:06
  • @Tim you specifically ask: "Which algorithms are mostly affected by it and which are able to deal with it? Which algorithms would need us to balance the data?" Can you rephrase this then? – AdamO Jun 07 '17 at 16:10
  • https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression – kjetil b halvorsen Jul 01 '18 at 20:11

8 Answers8

42

Not a direct answer, but it's worth noting that in the statistical literature, some of the prejudice against unbalanced data has historical roots.

Many classical models simplify neatly under the assumption of balanced data, especially for methods like ANOVA that are closely related to experimental design—a traditional / original motivation for developing statistical methods.

But the statistical / probabilistic arithmetic gets quite ugly, quite quickly, with unbalanced data. Prior to the widespread adoption of computers, the by-hand calculations were so extensive that estimating models on unbalanced data was practically impossible.

Of course, computers have basically rendered this a non-issue. Likewise, we can estimate models on massive datasets, solve high-dimensional optimization problems, and draw samples from analytically intractable joint probability distributions, all of which were functionally impossible like, fifty years ago.

It's an old problem, and academics sank a lot of time into working on the problem...meanwhile, many applied problems outpaced / obviated that research, but old habits die hard...

Edit to add:

I realize I didn't come out and just say it: there isn't a low level problem with using unbalanced data. In my experience, the advice to "avoid unbalanced data" is either algorithm-specific, or inherited wisdom. I agree with AdamO that in general, unbalanced data poses no conceptual problem to a well-specified model.

Henry
  • 631
  • 4
  • 5
  • 6
    While I seem to get your point, your premises lack arguments backing them. Could you give some arguments and/or examples on the prejudice and on how did if affect machine learning? – Tim Jun 07 '17 at 07:18
  • 1
    While what you say is mostly true, it *is* also the case that methods like anova is more robust with balanced data, nonnormality is less of an issue with balanced data, for example. But I believe all this is orthogonal to the intent of this question ... – kjetil b halvorsen Jun 07 '17 at 15:47
  • 7
    I realize I didn't come out and just say it: there _isn't_ a low level problem with using unbalanced data. In my experience, the advice to "avoid unbalanced data" is either algorithm-specific, or inherited wisdom. I agree with AdamO that in general, unbalanced data poses no conceptual problem to a well-specified model. – Henry Jun 08 '17 at 04:59
  • 2
    @M.HenryL. this comment is worth adding to your answer for completeness. – Tim Jun 13 '17 at 10:38
  • 2
    This doesn't seem to answer the question... the question is about the so-called class imbalance problem (i.e., lack of balance in a categorical ***dependent variable***), whereas this answer is about lack of balance in predictors and/or grouping/clustering factors (as in ANOVA and mixed models). They are almost completely different issues – Jake Westfall Jun 27 '20 at 22:52
  • 1
    @Henry "unbalanced data poses no conceptual problem to a well-specified model." This is correct, but isn't model misspecification present in practically all cases? And if it *is* present, then data imbalance *does* pose a problem and solutions to this problem have been worked out nicely, cf. my answer below. – jhin Jul 11 '20 at 15:25
28

Unbalanced data is only a problem depending on your application. If for example your data indicates that A happens 99.99% of the time and 0.01% of the time B happens and you try to predict a certain result your algorithm will probably always say A. This is of course correct! It is unlikely for your method to get better prediction accuracy than 99.99%. However in many applications we are not interested in just the correctness of the prediction but also in why B happens sometimes. This is where unbalanced data becomes a problem. Because it is hard to convince your method that it can predict better than 99.99% correct. The method is correct but not for your question. So solving unbalanced data is basically intentionally biasing your data to get interesting results instead of accurate results. All methods are vulnerable although SVM and logistic regressions tend to be a little less vulnerable while decision trees are very vulnerable.

In general there are three cases:

  1. You are purely interested in accurate prediction and you think your data is representative. In this case you do not have to correct at all. Bask in the glory of your 99.99% accurate predictions :).

  2. You are interested in prediction but your data is from a fair sample but somehow you lost a number of observations. If you lost observations in a completely random way you're still fine. If you lost them in a biased way but you don't know how biased, you will need new data. However if these observations are lost only on the basis of one charateristic. (for example you sorted results in A and B but not in any other way but lost half of B) Ypu can bootstrap your data.

  3. You are not interested in accurate global prediction, but only in a rare case. In this case you can inflate the data of that case by bootstrapping the data or if you have enough data throwing a way data of the other cases. Notice that this does bias your data and results and so chances and that kind of results are wrong!

In general it mostly depends on what the goal is. Some goals suffer from unbalanced data others don't. All general prediction methods suffer from it because otherwise they would give terrible results in general.

drdiem
  • 3
  • 3
zen
  • 435
  • 3
  • 5
  • 7
    How does this story change when we evalate our models probabilistically? – Matthew Drury Jun 06 '17 at 14:25
  • @MatthewDrury The probabalities from the original model are mostly correct for cases 1 and 3. The issue is that only with very large datasets B becomes correctly separable from A and the probablity of B slowly converges to its real value. The exception being that if B is very clearly separated from A or completely randomly separated from A, the probabalities will respectively almost immediately or never converge. – zen Jun 06 '17 at 14:36
  • @zen I rather disagree that logistic regression is less vulnerable. Logistic regression is quite vulnerable to data imbalance, it creates small sample bias and the log odds ratios tend toward a factor of 2. Conditional logistic regression is an alternative to estimating the same ORs without bias. – AdamO Jun 07 '17 at 16:11
  • @AdamO Yes logistic regression is still vulnerable. But for trees small cases can be completely ignored. it is not just small sample size either. even for large n and for example 80%-20% distribution between options trees can still opt for choosing the 80% option even if the fraction of the 20% option clearly increases with some variable x. If a new more extreme observation is found or if the number of branches is to low for any extreme point the tree will predict the 80% option while logistic regression will be less likely to do so. You are right about conditional logistical regression – zen Jun 07 '17 at 21:54
  • @zen it doesn't matter if it's 99.99% to 0.01% if the overall n is 1 billion. You have 100,000 of the "rare" group, which is adequate. It's not the proportion. It's the minimum count. – AdamO Jun 07 '17 at 22:30
  • @AdamO the minimum count does matter and too little data is a problem, but so is the proportion and parameterization of decision trees, if we have 100.000 datapoints but they are eclipsed in all the cases that the tree can discern, the tree will tell you nothing about that option. the logistic regression can tell you at least if the odds of that case are increasing. Furthermore if the tree has too little branches it can get very un-informative results even when there is information to be had. In continous variables the number of branches is a decision, and can lead to overly broad categories – zen Jun 08 '17 at 08:13
  • 1
    @AdamO you are ofcourse correct that logistic regression is still vunerable , I would just argue that it is less so. – zen Jun 08 '17 at 08:16
  • 3
    @MatthewDrury Stephen Senn has an excellent discussion about this point [in a paper I reread often](http://people.musc.edu/~elg26/teaching/statcomputing.2013/Lectures/Lecture27.LatexPapers/Senn.7Myths_randomization.pdf). Heuristically, the odds ratio from a 2x2 table with entries a b c d is estimated by ad/(bc) and has variance 1/a+1/b+1/c+1/d. You can sample arbitrarily few cases (a and c) and the odds ratio is still unbiased, but the variance goes to infinity. It is a precision issue. – AdamO Jun 08 '17 at 15:07
  • @AdamO Sorry for the late response. eclipsed means they are in the minority in each (non singleton) leaf the tree can make. this could happen for example in the case that two similar (but with slightly different means) normal distributions are generated and we draw random variables from both ;but we draw from one 999 times out of 1000 draws and the other once out of 1000. then it is likely that the tree will never predict the second distribution and it will give you no information about it. even if we draw 100000000 times – zen Jun 13 '17 at 14:28
  • @zen ok I take it it's terminology you use to illustrate a point, rather than a formal definition: basically low prevalence. We use risk models for low prevalence outcomes all the time, like cancer. If the prevalence of an outcome is 1 in 10,000, the tree would predict such an outcome 1 out of every 10,000 times which is exactly what is desired. It would require many, many observations to achieve a more reliable prediction but my answer addresses this as a power issue. – AdamO Jun 13 '17 at 15:51
  • 2
    This presumes implicitly (1) that the KPI we attempt to maximize is accuracy, and (2) that accuracy is an appropriate KPI for classification model evaluation. [It isn't.](https://stats.stackexchange.com/q/312780/1352) – Stephan Kolassa Jul 17 '18 at 05:58
  • 1
    @AdamO, isn't the bias spoken about wrt logistic regression in the coefficients/odds ratio etc., not the *probability estimates* which is presumably what is being discussed here wrt classifier performance – seanv507 Jan 22 '19 at 18:18
18

WLOG you can focus on imbalance in a single factor, rather than a more nuanced concept of "data sparsity", or small cell counts.

In statistical analyses not focused on learning, we are faced with the issue of providing adequate inference while controlling for one or more effects through adjustment, matching, or weighting. All of these have similar power and yield similar estimates to propensity score matching. Propensity score matching will balance the covariates in the analysis set. They all end up being "the same" in terms of reducing bias, maintaining efficiency because they block confounding effects. With imbalanced data, you may naively believe that your data are sufficiently large, but with a sparse number of people having the rarer condition: variance inflation diminishes power substantially, and it can be difficult to "control" for effects when those effects are strongly associated with the predictor and outcome.

Therefore, at least in regression (but I suspect in all circumstances), the only problem with imbalanced data is that you effectively have smaller sample size than the $N$ might represent. If any method is suitable for the number of people in the rarer class, there should be no issue if their proportion membership is imbalanced.

AdamO
  • 52,330
  • 5
  • 104
  • 209
10

Let's assume we have two classes:

  • A, representing 99.99% of the population
  • B, representing 0.01% of the population

Let's assume we are interested in identifying class B elements, that could be individuals affected by a rare disease or fraudster.

Just by guessing A learners would score high on their loss-functions and the very few incorrectly classified elements might not move, numerically, the needle (in a haystack, in this case). This example brings the intuition behind one of the "tricks" to mitigate the class imbalance problem: tweaking the cost function.

I feel that unbalanced data is a problem when models show near-zero sensitivity and near-one specificity. See the example in this article under the section "ignoring the problem".

Problems have often a solution. Alongside the aforementioned trick, there are other options. However, they come at a price: an increase in model and computational complexity.

The question asks which models are more likely to settle on near-zero sensitivity and near-one specificity. I feel that it depends on a few dimensions:

  • Less capacity, as usual.
  • Some cost functions might struggle more than others: mean squared error (MSE) is less exposed than Huber - MSE should be less benign towards incorrectly classified B class elements.
IcannotFixThis
  • 1,151
  • 7
  • 20
  • 5
    This presumes implicitly (1) that the KPI we attempt to maximize is accuracy, and (2) that accuracy is an appropriate KPI for classification model evaluation. [It isn't.](https://stats.stackexchange.com/q/312780/1352) – Stephan Kolassa Jul 17 '18 at 05:51
  • @StephanKolassa Looks like even if the KPI is not accuracy, say logistic regression's loss function is cross entropy, and given 100 negative vs 1 positive imbalanced data, it should still be likely that the model might not be able to tune the weights based on the 1 positive examples, hence the trained model might not give reasonable prediction for positive ones. – avocado Mar 15 '21 at 11:56
  • @avocado: both [cross entropy](https://stats.stackexchange.com/q/493912/1352) and the more common log likelihood are proper scoring rules, so a logistic regression should indeed be able to find the correct predictions. Why do you think differently? Incidentally, why do you write "reasonable prediction for *positive* ones"? Why would a probabilistic classifier output "unreasonable" predictions for only positive classes? Yes, of course the predicted probabilities will be below one, but that reflects the ex post selection, i.e., a kind of bias. – Stephan Kolassa Mar 15 '21 at 12:41
  • @StephanKolassa I think you're right that cross entropy should be good for logistic regression to learn and give correct predictions. But the un-answered question is why down-sampling negative examples is wildly used in industry? E.g. ctr prediction – avocado Mar 15 '21 at 13:23
  • 2
    @avocado: why over-/undersampling is popular is indeed a puzzling question. Some of the comments to [my question here](https://stats.stackexchange.com/q/357466/1352) go into this; e.g., [Sycorax' comment](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he#comment672220_357466). – Stephan Kolassa Mar 15 '21 at 13:28
7

If you think about it: On a perfectly separable highly imbalanced data set, almost any algorithm will perform without errors.

Hence, it is more a problem of noise in data and less tied to a particular algorithm. And you don't know beforehand which algorithm compensates for one particular type of noise best.

In the end you just have to try different methods and decide by cross validation.

Gerenuk
  • 1,833
  • 3
  • 14
  • 20
  • 1
    I feel this comment is a bit under-appreciated. I just spend a bit of time convincing someone that class imbalance is not _always_ a problem. – RDK May 25 '18 at 20:51
  • 1
    This does not answer the question. *How* are unbalanced classes "more a problem of noise in data"? – Stephan Kolassa Jul 17 '18 at 05:46
  • 4
    @StephanKolassa It is an answer, because it says unbalanced data is _not_ (directly) a problem. Hence you cannot ask "how" it is. For the more general question "how to deal with noise problems in data analysis", the answer is, that it is specific to individual data sets and all you can do is set up validation and try whatever works. If you really would like some discussion, I believe http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf has ideas. But in the end you would do sampling/reweighting/thresholding and it's not worth knowing what exactly happened in this data set. – Gerenuk Jul 17 '18 at 12:02
6

I know I'm late to the party, but: the theory behind the data imbalance problem has been beautifully worked out by Sugiyama (2000) and a huge number of highly cited papers following that, under the keyword "covariate shift adaptation". There is also a whole book devoted to this subject by Sugiyama / Kawanabe from 2012, called "Machine Learning in Non-Stationary Environments". For some reason, this branch of research is only rarely mentioned in discussions about learning from imbalanced datasets, possibly because people are unaware of it?

The gist of it is this: data imbalance is a problem if a) your model is misspecified, and b) you're either interested in good performance on a minority class or you're interested in the model itself.

The reason can be illustrated very simply: if the model does not describe reality correctly, it will minimize the deviation from the most frequently observed type of samples (figure taken from Berk et al. (2018)): enter image description here

I will try to give a very brief summary of the technical main idea of Sugiyama. Suppose your training data are drawn from a distribution $p_{\mathrm{train}}(x)$, but you would like the model to perform well on data drawn from another distribution $p_{\mathrm{target}}(x)$. This is what's called "covariate shift", and it can also simply mean that you would like the model to work equally well on all regions of the data space, i.e. $p_{\mathrm{target}}(x)$ may be a uniform distribution. Then, instead of minimizing the expected loss over the training distribution

$$ \theta^* = \arg \min_\theta E[\ell(x, \theta)]_{p_{\text{train}}} \approx \arg \min_\theta \frac{1}{N}\sum_{i=1}^N \ell(x_i, \theta)$$

as one would usually do, one minimizes the expected loss over the target distribution:

$$ \theta^* = \arg \min_\theta E[\ell(x, \theta)]_{p_{\text{target}}} \\ = \arg \min_\theta E\left[\frac{p_{\text{target}}(x)}{p_{\text{train}}(x)}\ell(x, \theta)\right]_{p_{\text{train}}} \\ \approx \arg \min_\theta \frac{1}{N}\sum_{i=1}^N \underbrace{\frac{p_{\text{target}}(x_i)}{p_{\text{train}}(x_i)}}_{=w_i} \ell(x_i, \theta)$$

In practice, this amounts to simply weighting individual samples by their importance $w_i$. The key to practically implementing this is an efficient method for estimating the importance, which is generally nontrivial. This is one of the main topics of papers on this subject, and many methods can be found in the literature (keyword "Direct importance estimation").

All the oversampling / undersampling / SMOTE techniques people use are essentially just different hacks for implementing importance weighting, I believe.

jhin
  • 749
  • 4
  • 12
  • >>>@"The reason can be illustrated very simply: if the model does not describe reality correctly, it will minimize the deviation from the most frequently observed type of samples" you mean the baseline estimator, but what when the real distribution is 90% to 10% in an binary classifier? – Max Kleiner Mar 24 '21 at 13:15
  • @MaxKleiner I'm not sure I understand your question correctly, could you elaborate a bit? – jhin Mar 31 '21 at 16:20
  • One one side we have an actual distribution of 90 to 10. On the other side we downsample the unbalanced data (50 to 50) to prevent accuracy of the baseline estimator which has always an accuracy of 90%. Q: is the downsample (regularize unbalanced data) needed or not? – Max Kleiner Apr 06 '21 at 15:25
  • @MaxKleiner If the estimator should perform equally well across both groups (i.e the target distribution has a 50/50 ratio) then you should downsample. If you want to maximize accuracy on a 90/10 distribution then you should not, but then a corollary is that accuracy will mostly be optimized for the 90% group. I suspect in most examples one would want to downsample (or weight, or use SMOTE, etc.). – jhin May 06 '21 at 08:07
  • 1
    @MaxKleiner You're touching on an important point there. The covariate shift literature often makes it sound as if ptarget were the actual real-world distribution you want to apply the estimator to. But ptarget should be tailored to what you really want, e.g., putting equal weight on different groups or regions of the data space. – jhin May 06 '21 at 08:09
  • 1
    Reweighting to handle a different target distribution is well understood.The main use case is reducing cost by undersampling the majority class for training data – seanv507 Jun 18 '21 at 06:48
3

Great answers above, and not sure how much I can add here, but I feel there are three things to consider with imbalanced data, and new trade-offs you'll have to consider when rebalancing. I'd like to frame this in the context of predicting a minority outcome (a common task with imbalanced classes):

  1. By resampling, you may improve overall accuracy, but usually its the case that with severe class imbalance, you're actually trying to predict or otherwise describe features of the minority class. The best evaluation metrics here would then be F1 scores, precision, recall and the like. The resampling process (whether by SMOTE, or undersampling the majority class, etc.) disrupts the distribution of your data which naturally occurs, and training performed on these artificially created classes will usually perform poorly when applied to back to natural distributions. This creates sampling bias, in a sense. In my own work, I've found Random Forest Classifier does a bit better job than Logistic Regression, although it's pretty "data hungry" and will require enough minority samples (see point #2). It really depends on the question you're trying to answer and the types of data you have available.

  2. It's quite possible you may already have enough of the minority class to make useful predictions. Consider a class imbalance of 100:1. Would 1000 majority samples and 10 minority samples make for a useful classifier? Of course not. But what about 1,000,000 majority and 10,000 minority samples? The model you select may have enough of the minority outcome to make useful predictions. It's minority count - not relative proportion to the majority class - that is ultimately important.

  3. More of a general point is that we've become obsessed with correcting class imbalances, as if its a central problem. Left by the wayside are the far more important tasks of feature engineering and proper model selection as is necessary for predicting minority outcomes.

I wrote a piece about this here: Why Balancing Classes is Over-Hyped

  • Great article @Gabe Verzino. I found the pitfalls you mentioned to be spot on for the classification problem I was working on. My data has an imbalance of 4:1, and balancing the data affected the performance when the model was supplied with real-world data. I had a fair amount of data, 400k samples for the majority class and 100k for the minority class. For my use case, adding more data was better for generalization than balancing the data. – Raqib Sep 22 '21 at 18:06
  • That's awesome! Nice work. – Gabe Verzino Sep 22 '21 at 21:03
1

For me the most important problem with unbalanced data is the baseline estimator. For example, you have two classes with 90% and 10% sample distribution. But what does this mean for a dummy or naive classifier? You can infer this meaning by comparing it with a baseline’s performance. You can always predict the most frequent label in the training set, so the model has to be better or at least than 90% (this is the baseline)!

Typical baselines include those supported by scikit-learn’s "dummy" estimators:

Classification baselines:

  • "stratified": generates predictions by respecting the training set’s class distribution.
  • "most_frequent": always predicts the most frequent label in the training set.
  • "prior": always predicts the class that maximizes the class prior.
Max Kleiner
  • 111
  • 2