Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Question

TL;DR

See title.

Motivation

I am hoping for a canonical answer along the lines of "(1) No, (2) Not applicable, because (1)", which we can use to close many wrong questions about unbalanced datasets and oversampling. I would be quite as happy to be proven wrong in my preconceptions. Fabulous Bounties await the intrepid answerer.

My argument

I am baffled by the many questions we get in the unbalanced-classes tag. Unbalanced classes seem to be self-evidently bad. And oversampling the minority class(es) is quite as self-evidently seen as helping to address the self-evident problems. Many questions that carry both tags proceed to ask how to perform oversampling in some specific situation.

I understand neither what problem unbalanced classes pose, nor how oversampling is supposed to address these problems.

In my opinion, unbalanced data do not pose a problem at all. One should model class membership probabilities, and these may be small. As long as they are correct, there is no problem. One should, of course, not use accuracy as a KPI to be maximized in a classification problem. Or calculate classification thresholds. Instead, one should assess the quality of the entire predictive distribution using proper scoring-rules. Tetlock's Superforecasting serves as a wonderful and very readable introduction to predicting unbalanced classes, even if this is nowhere explicitly mentioned in the book.

What problem does oversampling, undersampling, and SMOTE solve? IMO, this question does not have a satisfactory answer. (Per my suspicion, this may be because there is no problem.)
When is unbalanced data really a problem in Machine Learning? The consensus appears to be "it isn't". I'll probably vote to close this question as a duplicate of that one.

IcannotFixThis' answer, seems to presume (1) that the KPI we attempt to maximize is accuracy, and (2) that accuracy is an appropriate KPI for classification model evaluation. It isn't. This may be one key to the entire discussion.

AdamO's answer focuses on the low precision of estimates from unbalanced factors. This is of course a valid concern and probably the answer to my titular question. But oversampling does not help here, any more than we can get more precise estimates in any run-of-the-mill regression by simply duplicating each observation ten times.

What is the root cause of the class imbalance problem? Some of the comments here echo my suspicion that there is no problem. The single answer again implicitly presumes that we use accuracy as a KPI, which I find unsatisfactory.
Are there Imbalanced learning problems where re-balancing/re-weighting demonstrably improves accuracy? is related, but presupposes accuracy as an evaluation measure. (Which I argue is not a good choice.)

Summary

The threads above can apparently be summarized as follows.

Rare classes (both in the outcome and in predictors) are a problem, because parameter estimates and predictions have high variance/low precision. This cannot be addressed through oversampling. (In the sense that it is always better to get more data that is representative of the population, and selective sampling will induce bias per my and others' simulations.)
Rare classes are a "problem" if we assess our model by accuracy. But accuracy is not a good measure for assessing classification models. (I did think about including accuracy in my simulations, but then I would have needed to set a classification threshold, which is a closely related wrong question, and the question is long enough as it is.)

An example

Let's simulate for an illustration. Specifically, we will simulate ten predictors, only a single one of which actually has an impact on a rare outcome. We will look at two algorithms that can be used for probabilistic classification: logistic-regression and random-forests.

In each case, we will apply the model either to the full dataset, or to an oversampled balanced one, which contains all the instances of the rare class and the same number of samples from the majority class (so the oversampled dataset is smaller than the full dataset).

For the logistic regression, we will assess whether each model actually recovers the original coefficients used to generate the data. In addition, for both methods, we will calculate probabilistic class membership predictions and assess these on holdout data generated using the same data generating process as the original training data. Whether the predictions actually match the outcomes will be assessed using the Brier score, one of the most common proper scoring rules.

We will run 100 simulations. (Cranking this up only makes the beanplots more cramped and makes the simulation run longer than one cup of coffee.) Each simulation contains $n=10,000$ samples. The predictors form a $10,000\times 10$ matrix with entries uniformly distributed in $[0,1]$. Only the first predictor actually has an impact; the true DGP is

$$ \text{logit}(p_i) = -7+5x_{i1}. $$

This makes for simulated incidences for the minority TRUE class between 2 and 3%:

Let's run the simulations. Feeding the full dataset into a logistic regression, we (unsurprisingly) get unbiased parameter estimates (the true parameter values are indicated by the red diamonds):

However, if we feed the oversampled dataset to the logistic regression, the intercept parameter is heavily biased:

Let's compare the Brier scores between models fitted to the "raw" and the oversampled datasets, for both the logistic regression and the Random Forest. Remember that smaller is better:

In each case, the predictive distributions derived from the full dataset are much better than those derived from an oversampled one.

I conclude that unbalanced classes are not a problem, and that oversampling does not alleviate this non-problem, but gratuitously introduces bias and worse predictions.

Where is my error?

A caveat

I'll happily concede that oversampling has one application: if

we are dealing with a rare outcome, and
assessing the outcome is easy or cheap, but
assessing the predictors is hard or expensive

A prime example would be genome-wide association studies (GWAS) of rare diseases. Testing whether one suffers from a particular disease can be far easier than genotyping their blood. (I have been involved with a few GWAS of PTSD.) If budgets are limited, it may make sense to screen based on the outcome and ensure that there are "enough" of the rarer cases in the sample.

However, then one needs to balance the monetary savings against the losses illustrated above - and my point is that the questions on unbalanced datasets at CV do not mention such a tradeoff, but treat unbalanced classes as a self-evident evil, completely apart from any costs of sample collection.

R code

    library(randomForest)
    library(beanplot)
    
    nn_train <- nn_test <- 1e4
    n_sims <- 1e2
    
    true_coefficients <- c(-7, 5, rep(0, 9))
    
    incidence_train <- rep(NA, n_sims)
    model_logistic_coefficients <- 
         model_logistic_oversampled_coefficients <- 
         matrix(NA, nrow=n_sims, ncol=length(true_coefficients))
    
    brier_score_logistic <- brier_score_logistic_oversampled <- 
      brier_score_randomForest <- 
    brier_score_randomForest_oversampled <- 
      rep(NA, n_sims)
    
    pb <- winProgressBar(max=n_sims)
    for ( ii in 1:n_sims ) {
        setWinProgressBar(pb,ii,paste(ii,"of",n_sims))
        set.seed(ii)
        while ( TRUE ) {    # make sure we even have the minority 
                            # class
            predictors_train <- matrix(
              runif(nn_train*(length(true_coefficients) - 1)), 
                  nrow=nn_train)
            logit_train <- 
             cbind(1, predictors_train)%*%true_coefficients
            probability_train <- 1/(1+exp(-logit_train))
            outcome_train <- factor(runif(nn_train) <= 
                     probability_train)
            if ( sum(incidence_train[ii] <- 
               sum(outcome_train==TRUE))>0 ) break
        }
        dataset_train <- data.frame(outcome=outcome_train, 
                          predictors_train)
        
        index <- c(which(outcome_train==TRUE),  
          sample(which(outcome_train==FALSE),   
                sum(outcome_train==TRUE)))
        
        model_logistic <- glm(outcome~., dataset_train, 
                    family="binomial")
        model_logistic_oversampled <- glm(outcome~., 
              dataset_train[index, ], family="binomial")
        
        model_logistic_coefficients[ii, ] <- 
               coefficients(model_logistic)
        model_logistic_oversampled_coefficients[ii, ] <- 
          coefficients(model_logistic_oversampled)
        
        model_randomForest <- randomForest(outcome~., dataset_train)
        model_randomForest_oversampled <- 
          randomForest(outcome~., dataset_train, subset=index)
        
        predictors_test <- matrix(runif(nn_test * 
            (length(true_coefficients) - 1)), nrow=nn_test)
        logit_test <- cbind(1, predictors_test)%*%true_coefficients
        probability_test <- 1/(1+exp(-logit_test))
        outcome_test <- factor(runif(nn_test)<=probability_test)
        dataset_test <- data.frame(outcome=outcome_test, 
                         predictors_test)
    
        prediction_logistic <- predict(model_logistic, dataset_test, 
                                        type="response")
        brier_score_logistic[ii] <- mean((prediction_logistic - 
               (outcome_test==TRUE))^2)
    
        prediction_logistic_oversampled <-      
               predict(model_logistic_oversampled, dataset_test, 
                        type="response")
        brier_score_logistic_oversampled[ii] <- 
          mean((prediction_logistic_oversampled - 
                (outcome_test==TRUE))^2)
        
        prediction_randomForest <- predict(model_randomForest, 
            dataset_test, type="prob")
        brier_score_randomForest[ii] <-
          mean((prediction_randomForest[,2]-(outcome_test==TRUE))^2)
    
        prediction_randomForest_oversampled <-   
                         predict(model_randomForest_oversampled, 
                                  dataset_test, type="prob")
        brier_score_randomForest_oversampled[ii] <- 
          mean((prediction_randomForest_oversampled[, 2] - 
                (outcome_test==TRUE))^2)
    }
    close(pb)
    
    hist(incidence_train, breaks=seq(min(incidence_train)-.5, 
            max(incidence_train) + .5),
      col="lightgray",
      main=paste("Minority class incidence out of", 
                    nn_train,"training samples"), xlab="")
    
    ylim <- range(c(model_logistic_coefficients, 
                   model_logistic_oversampled_coefficients))
    beanplot(data.frame(model_logistic_coefficients),
      what=c(0,1,0,0), col="lightgray", xaxt="n", ylim=ylim,
      main="Logistic regression: estimated coefficients")
    axis(1, at=seq_along(true_coefficients),
      c("Intercept", paste("Predictor", 1:(length(true_coefficients) 
             - 1))), las=3)
    points(true_coefficients, pch=23, bg="red")
    
    beanplot(data.frame(model_logistic_oversampled_coefficients),
      what=c(0, 1, 0, 0), col="lightgray", xaxt="n", ylim=ylim,
      main="Logistic regression (oversampled): estimated 
              coefficients")
    axis(1, at=seq_along(true_coefficients),
      c("Intercept", paste("Predictor", 1:(length(true_coefficients) 
             - 1))), las=3)
    points(true_coefficients, pch=23, bg="red")
    
    beanplot(data.frame(Raw=brier_score_logistic, 
            Oversampled=brier_score_logistic_oversampled),
      what=c(0,1,0,0), col="lightgray", main="Logistic regression: 
             Brier scores")
    beanplot(data.frame(Raw=brier_score_randomForest, 
      Oversampled=brier_score_randomForest_oversampled),
      what=c(0,1,0,0), col="lightgray", 
              main="Random Forest: Brier scores")

I've got more or less the same question: https://stats.stackexchange.com/questions/285231/what-problem-does-oversampling-undersampling-and-smote-solve ! — Matthew Drury, Jul 16 '18 at 21:41
I've also ran the same simulation, with an even wider selection of models, and a wider range of prior class probabilities, and observed the same results. Additionally, if you measure the AUC of your models, you'll notice that they are all the same, regardless of the class balance of your training data. I wonder about the source of this wide conception on the evils of class balance, where did it come from, how did we get to this point? — Matthew Drury, Jul 16 '18 at 21:44
And: the question isn't how we got to this point, but **how do we get away from it???** — Stephan Kolassa, Jul 16 '18 at 21:45
Agreed, but I still think the "how did we get here" problem is interesting! — Matthew Drury, Jul 16 '18 at 21:46
I just see that I had already upvoted your question. And [Tim's question](https://stats.stackexchange.com/q/283170/1352) you link to. I am getting old. Or it may be the alcohol. — Stephan Kolassa, Jul 16 '18 at 21:49
Honestly, knowing there is someone else out there that is mystified by the endless class balancing questions is comforting. — Matthew Drury, Jul 16 '18 at 21:52
"How did we get here?" is a great question. I don't know the definitive answer. But my hunch is that this all started when the machine learning community was only concerned with *accuracy*. Eventually someone pointed out that stupidly high accuracy can be achieved if (1) your classes are severely imbalanced and (2) you predict the majority class. Instead of measuring model quality with a metric other than accuracy, oversampling/SMOTE/etc were all invented to "solve" this problem. This isn't a history, just a story I made up based on my impressions and observable evidence. — Sycorax, Jul 16 '18 at 21:55
@Sycorax That is also my take on the tragedy. Combined with a lot of inherited wisdom digested without reflection. — Matthew Drury, Jul 16 '18 at 21:57
Is "how did we get here" a great question in the "Yes, ask that question on CV" sense, or in the, "I idly wonder about that also" sense? — Matthew Drury, Jul 16 '18 at 21:58
I imagine that unbalance data can screw up estimates for variance or dispersion. But maybe I am thinking with the wrong, non-machine-learning, perspective. So I look up the wikipedia https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis and don't they describe a case which is more like 'if your predictor1 would not be sampled from the true population in an unbalanced (possibly biased) way'? — Sextus Empiricus, Jul 16 '18 at 21:58
The fixation on accuracy is reflected in some software, too, like Breiman's `randomForest` having built-in method to measure OOB accuracy but no other metric. This has some depressing consequences: people will tune the number of trees in a random forest for no reason other than they've been systematically mislead into thinking that doing so is meaningful. More discussion on this point wrt random forest: https://stats.stackexchange.com/questions/348245/do-we-have-to-tune-the-number-of-trees-in-a-random-forest — Sycorax, Jul 16 '18 at 22:01
I think a large part of this comes from "big data." For rare events, you need a lot of data, and perhaps before (say, 20 years ago), we saw less class imbalance because you'd have laughably few positive examples in your dataset, hence wouldn't even try using it. Nowedays you might easily have a dataset with millions of rows and say, a few hundred positive examples. — Alex R., Jul 16 '18 at 22:04
@StephanKolassa It is a bit difficult to find it in the code and the text, but are you comparing in those plots unbalanced data versus oversampled unbalanced data, or are you comparing balanced data versus oversampled unbalanced data? — Sextus Empiricus, Jul 16 '18 at 22:09
@Sycorax Don't get me started on the sklearn's developers decision to map the `predict` method on models to the hard decision rule thresholding the probabilities at `0.5`. — Matthew Drury, Jul 16 '18 at 22:11
@MartijnWeterings: I'm comparing an unbalanced sample of $n=10^4$ against a balanced subsample, which I get by taking all the (2-3% minority classes) plus an equal number sampled from the majority class. — Stephan Kolassa, Jul 16 '18 at 22:11
@MatthewDrury: For even more hair-pulling, try obtaining confidence intervals for a logistic regression in sklearn (hint: you can't). — Alex R., Jul 16 '18 at 22:13
@MatthewDrury: R's `predict.randomForest()` does the same by default, though you can at least specify `type="prob"`. — Stephan Kolassa, Jul 16 '18 at 22:13
In my experiments I build some samplers where I could adjust the prior class probabilities. I sampled datasets at 25 values of the prior class probabilities from 0.5 up to 0.99. Then I split into train and test, fit models to the train, and evaluated on the test. Using the AUC, there was no degradation in performance. I selected the AUC since it has the same baseline value regardless of prior class probability (while say, log-loss, changes baseline). I did this over many parameters in the sampler, which changed the structure of the data considerable, and for many different models. — Matthew Drury, Jul 16 '18 at 22:14
Your caveats are definitely prime examples of the need for over/under sampling. A good example is running word embeddings (say, Word2Vec), where there are massive class imbalances and rare occurances, which would otherwise get washed away without correcting for sampling. Keep in mind that oversampling doesn't only improve models in terms of their accuracy, but more importantly it *speeds up* model training especially for non-convex optimizations. — Alex R., Jul 16 '18 at 22:15
@StephanKolassa but is that really balancing when 2-3% minority class is the true representation of the population? Is balanced vs unbalanced about whether or not the groups are all equal numbered or whether the groups are representing the population? The more interesting case would be to see what happens when you increase the weights of the majority class in the under-sampled samples in order to get back to a proper representation. — Sextus Empiricus, Jul 16 '18 at 22:27
@MartijnWeterings To the vast majority of users of this site asking about class balancing, it's about having the positive and negative classes equally represented, which takes them AWAY from the population representation. — Matthew Drury, Jul 16 '18 at 22:29
This is really a duplicate of @MatthewDrury's question. However, that one did not get a satisfactory answer so +1. Maybe you should answer each other's questions with your simulations :-) Stephan, regarding your simulation: I don't understand where the bias in the oversampled results comes from. Why is the intercept estimate biased and why is the Brier score worse? I'd naively expect oversampling to not matter on average, at least in this setting. — amoeba, Jul 16 '18 at 22:31
@amoeba: Why wouldn't it change it? If anything oversampling distorts the original data distribution so it is "expected" that the baseline (i.e. the intercept) is shifted. If anything when reading the post I immediately thought "Yeah, obvious... Tell me about Predictor 1" — usεr11852, Jul 16 '18 at 22:43
@amoeba, you are changing the bias if you shift representation of the groups. The predictors are supposed to represent some probability for the one or other class (which should be 2-3% versus 97-98% and not 50-50%). To me this is a false idea of balancing. Balancing *is* correct when *done* correct. If one thing then this example actually shows how unbalanced (not 50-50, but instead unbalanced from that different interpretation) are indeed problematic because they create bias. — Sextus Empiricus, Jul 16 '18 at 22:46
If you increase the one class from it's true 2-3% to 50% then certainly the baseline may go up. — Sextus Empiricus, Jul 16 '18 at 22:49
@usεr11852 You are right. I was too quick to post my comment. However, here is what I had in mind: it is indeed no surprise that Brier score after oversampling is worse and also that the coefficients are wrong. As you say, this is because oversampling explicitly changes the baseline probability. But if somebody is doing oversampling then they are not interested in the correct probabilistic predictions. They are probably interested in accuracy. So, a question to Stephan: does the accuracy (conditioned on class 1 and 2) become lower for `model_logistic_oversampled` compared to `model_logistic`? — amoeba, Jul 17 '18 at 05:10
@amoeba: for the sake of this question I'd argue that one of the problems with accuracy (besides not being a proper scoring rule) is that there isn't *one* accuracy in the sense that unless sensitivity and specificity happen to be the equal, accuracy depends on the relative class frequencies. We thus get a "self-fullfilling prophecy": someone who oversamples for training will typically not look at accuracy at the natural relative frequencies but at accuracy for similarly oversmapled data. Thus, training and verifying a model that proper validation would deem irrelevant. — cbeleites unhappy with SX, Sep 24 '19 at 08:45
@StephanKolassa Thanks for this post, but I want to add a few more caveats to your point about oversampling. In some cases, the collected sample is not representative of the population; we may care more about the minority class (anomaly detection); or we simply _may not wish to replicate the population_. An example of the latter that comes to mind is the incident with Amazon's hiring AI which turned out to be sexist, presumably because it was trained on their _already sexist_ database of employees. In such cases, it makes all the sense to rebalance your dataset to train your algorithms. — Michael, Jan 23 '21 at 13:11
What does this sentence mean "_One should model class membership probabilities, and these may be small_" ? — Minsky, Jan 25 '21 at 19:20
@Minsky: in a two-class problem, what we should be interested in is the probability for an instance to belong to class A or B, conditional on the predictor values for that instance. In an "unbalanced" problem, these probabilities are small. But they may still be influenced by predictors. For example, the probability of defaulting on a loan may be 0.01 overall, but if a particular applicant has low income, low assets, no job and a history of defaulting, this probability may be 0.3 for this particular instance. (And 0.001 for someone with better characteristics.) ... — Stephan Kolassa, Jan 26 '21 at 07:04
... Note that even the "high risk" may have a risk lower than 0.5! Next, we need to make decisions based on these probabilities. We may offer the good risk better conditions, or not offer a loan to the bad risk at all. This decision should depend on the predicted probabilities of defaulting, and also the costs involved. (More precisely, we should model the probability of defaulting after a certain time, when some of the loan has already been repaid, so we have already had some of the principal and interest repaid.) — Stephan Kolassa, Jan 26 '21 at 07:07
I struggle to see how it is not an issue at some points - take e.g a Random Forest with Gini as splitting criteria. Imbalanced data here really can mess up the splits, due to the definition. I read this post/answer as "imbalanced data isn't a problem when handled correctly" which of course is a trivial statement, which can be applied to everything. The issue is often that the fitting of most classifiers are written for balanced data - and I wonder how that it is not an issue (if not handled)? — CutePoison, Jul 10 '21 at 17:31
@CutePoison: you are completely right that it's trivial that "unbalanced" data are not a problem when treated correctly - and also that many classifiers work only for balanced data. (Most of these IMO written by people with little statistical understanding.) Which causes no end of issues, mainly if classifiers are evaluated using accuracy and similar - witness almost daily questions here on CV. And yes, the answer is simple: don't use inappropriate models, or KPIs. This is not rocket science, but it's still apparently less well known than it should be. — Stephan Kolassa, Jul 12 '21 at 08:27
I was wondering the same thing. I had great accuracy since my stroke target was imbalanced 95/5 with the [0, no stroke] outcome as the majority. My baseline as instructed by my university prof was 95% and I cringed at the thought of trying to build a model to surpass 95%. Originally I downsampled the majority but still couldn't get better than baseline so I switched to upsampling the minority. After that I was able to get 98% using CART, KKN. But the whole time I felt like a fraud because of balancing and aiming for near 100% which is obviously insane. — Edison, Aug 29 '21 at 05:34
@Edison: [per this thread](https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models), don't use accuracy. Also: [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) and [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352). Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). — Stephan Kolassa, Aug 29 '21 at 08:46
@StephanKolassa Thanks for that. Btw, for binary classification, can we use recall and precision as mentioned in other threads if we don't want to use Scoring Rules? And what if I have already upsampled my minority outcome? Is it then ok to use accuracy or recall or precision? — Edison, Aug 29 '21 at 12:41
@Edison: you *can* use them, as in "you *can* choose to shoot yourself in the foot". It's still not a good idea, because optimizing them will give you biased estimates and predictions, completely analogously to how optimizing accuracy will (as in: if you have no useful predictors, then optimizing accuracy will automatically lead you to always predicting the majority class). — Stephan Kolassa, Aug 29 '21 at 12:52
Just to be clear because I'm still a student, you are saying I **should not use** `(accuracy, recall, precision)` for binary classification **in any circumstance** *even if I have a balanced target or have balanced the target myself?* In binary classification **only** use Scoring Rules? Would using recall or precision be better than using accuracy or would it be just as detrimental? — Edison, Aug 29 '21 at 13:07
@Edison: yes, that is exactly what I am recommending. [This thread may be helpful.](https://stats.stackexchange.com/a/368979/1352) I also do not see why recall or precision should be preferable to accuracy. (My nagging suspicion is that someone noticed problems with accuracy and looked for *some* other KPI that looked like adding it to the mix would improve matters - without digging deeply enough and noticing that *the problem lies with hard 0-1 classifications in the first place*.) — Stephan Kolassa, Aug 29 '21 at 13:16
Thanks. I'm so glad we are being taught to use accuracy for binary classification in my graduate program ;) — Edison, Aug 29 '21 at 13:50
@Sycorax: I think you are right. It might have been started with the ML community. However to me it's not about accuracy but convergence. Despite having this conversation (and other great ones about proper metrics and synthetic data) in mind I tried some class weights in my latest "Deep NN" experiment. I haven't completely sorted it out, but class weights seems to have an impact on convergence. Sometimes it leads to faster convergence, sometimes it helps avoid local minima and get better performance. It also seems to allievate 'seed-dependance' phenomenon. — lcrmorin, Sep 08 '21 at 10:16
All this to say that there might be some empirical evidence from the ML community in favor of rebalancing classes. And the point where it would be usefull (reducing convergence time, reducing dependance to initialisation, avoiding local minima traps) are not covered with the logistic example above. — lcrmorin, Sep 08 '21 at 10:21
@lcrmorin This paper may be of interest [What is the Effect of Importance Weighting in Deep Learning?](http://proceedings.mlr.press/v97/byrd19a.html) by Jonathon Byrd, Zachary Lipton. — Sycorax, Sep 08 '21 at 13:31
Thank you for sharing. Somehow I wasn't able to find it when I was looking for litterature. It does seem to confirm there are positive impacts of weighting for DL. In short: weigthing implies earlier convergence, interacts with L2 regularisation, batch norm but not drop out. — lcrmorin, Sep 08 '21 at 14:23
This should probably be reformatted so the bulk of the question becomes an answer. Right now it can't be used as a target when flagging duplicates — Hong Ooi, Sep 16 '21 at 13:23
@Sycorax another element of "how did we get here" might be: there's tons of recent academic papers about SMOTE and its close relatives, but I cannot find a single good reference paper that I can point people to and that explains the issue and the "correct" way of solving these problems clearly. — jhin, Nov 11 '21 at 19:53
General comment about statistical analysis: any method that disrespects the original sample size and how the sample came about is bogus. — Frank Harrell, Jan 05 '22 at 13:42
I only just noticed this, but could you please remove the rm(list=ls()) line from your code? — Dave, Jan 05 '22 at 16:31
@Dave: to be honest, I would rather keep it in. I have too often seen "reproducible" code that only ran because the R workspace contained something the poster forgot to define in their code. I'll take it out because you asked so nicely... — Stephan Kolassa, Jan 05 '22 at 16:59

Dikran Marsupial · Accepted Answer · 2022-01-08T09:30:22.427

I'd like to start by seconding a statement in the question:

... my point is that the questions on unbalanced datasets at CV do not mention such a tradeoff, but treat unbalanced classes as a self-evident evil, completely apart from any costs of sample collection.

I also have the same concern, my questions here and here are intended to invite counter-evidence that it is a "self-evident evil" the lack of answers (even with a bounty) suggests it isn't. A lot of blog posts and academic papers don't make this clear either. Classifiers can have a problem with imbalanced datasets, but only where the dataset is very small, so my answer is concerned with exceptional cases, and does not justify resampling the dataset in general.

There is a class imbalance problem, but it is not caused by the imbalance per se, but because there are too few examples of the minority class to adequately describe it's statistical distribution. As mentioned in the question, this means that the parameter estimates can have high variance, which is true, but that can give rise to a bias in favour of the majority class (rather than affecting both classes equally). In the case of logistic regression, this is discussed by King and Zeng,

3 Gary King and Langche Zeng. 2001. “Logistic Regression in Rare Events Data.” Political Analysis, 9, Pp. 137–163. https://j.mp/2oSEnmf

[In my experiments I have found that sometimes there can be a bias in favour of the minority class, but that is caused by wild over-fitting where the class-overlap dissapears due to random sampling, so that doesn't really count and (Bayesian) regularisation ought to fix that]

The good thing is that MLE is asymptotically unbiased, so we can expect this bias against the minority class to go away as the overall size of the dataset increases, regardless of the imbalance.

As this is an estimation problem, anything that makes estimation more difficult (e.g. high dimensionality) seems likely to make the class imbalance problem worse.

Note that probabilistic classifiers (such as logistic regression) and proper scoring rules will not solve this problem as "popular statistical procedures, such as logistic regression, can sharply underestimate the probability of rare events" 3. This means that your probability estimates will not be well calibrated, so you will have to do things like adjust the threshold (which is equivalent to re-sampling or re-weighting the data).

So if we look at a logistic regression model with 10,000 samples, we should not expect to see an imbalance problem as adding more data tends to fix most estimation problems.

So an imbalance might be problematic, if you have an extreme imbalance and the dataset is small (and/or high dimensional etc.), but in that case it may be difficult to do much about it (as you don't have enough data to estimate how big a correction to the sampling is needed to correct the bias). If you have lots of data, the only reason to resample is because operational class frequencies are different to those in the training set or different misclassification costs etc. (if either are unknown or variable, your really ought to use a probabilistic classifier).

This is mostly a stub, I hope to be able to add more to it later.

Thank you, I am looking forward to your expanding this. If I understand you correctly, the class imbalance problem you see is high variance of parameter estimates, right? It seems to me that oversampling etc. would not address this, correct? — Stephan Kolassa, Jan 05 '22 at 14:00
@StephanKolassa in principle it can, but you need to know the right amount of oversampling to apply (or equivalently reweighting or threshold adjustment), which is going to be difficult if you don't have enough data to estimate the model in the first place. King takes the threshold adjustment approach, as it can be analytically approximated for logistic regression, but it is not clear it has great practical utility. — Dikran Marsupial, Jan 05 '22 at 14:05
@StephanKolassa I think I didn't give a very direct answer to your question. I think what is happening is that the variance in the parameter estimates causes the undue bias against the minority class because of the structure of the problem. Reducing the variance ought to reduce the bias, but resampling (or better re-weighting the data in the loss) can address the bias directly. However, it will be difficult to estimate how much regularisation or re-sampling/re-weighting is required in practice. — Dikran Marsupial, Jan 08 '22 at 09:26
I just finished reading the King & Zeng paper, thank you for drawing my attention to it! It was most illuminating, and I have learned something today. (To add to your comments, they add one additional reason in favor of nonrandom sampling: the costs of collecting data.) — Stephan Kolassa, Jan 09 '22 at 13:09
My intuition is that logistic regression is likely to be more robust to this sort of bias than most, and that some classifiers may be more susceptible to problems in practical applications (but still with small datasets) and that is perhaps why there is some perception that class imbalance is a problem. — Dikran Marsupial, Jan 09 '22 at 13:15