5

I've been trying to build a binary classification model using multivariate logistic regression using the caret package in R. My dataset consists of around 20000 observations from which >99% belongs to the X class and only <1% to the Y class, and therefore it is an unbalanced dataset.

According to the book of Max Kuhn and Kjell Johnson (Applied Predictive Modeling, Springer 2013) class imbalance can be managed by either downsampling the majority class or upsampling the minority class of the dataset before training the model. I decided to test both solutions using the same training dataset to compare the results.

The downsampled data set consisted of 822 observations (411 in each class) and the upsampled dataset consisted of 45272 observations (22636 in each class). Both data sets are now "balanced" but I'm not sure which approach to choose. Below I show you the models performances in the training dataset (10-fold CV repeated 5 times).

In terms of sensitivity and specificity, both options (upsampling and downsampling) gave me similar results, although the parameters' standard deviation was 10-fold greater for the downsampled case:

UPSAMPLING results:

ROC                                    
0.7711678          
Sens
0.7011926
Spec
0.697951
ROC SD
0.005977932 
Sens SD
0.00834598
Spec SD
0.009579698

DOWNSAMPLING results:

ROC                               
0.7781663            
Sens
0.7212311
Spec
0.4199943
ROC SD
0.0445779
Sens SD
0.06813285
Spec SD
0.0724861

However, in terms of the significance of the predictors, for the downsampled case only four predictors were significant:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.561091   1.507289  -4.353 1.34e-05 ***
sexFemale   -0.002311   0.217136  -0.011    0.992    
age          0.044491   0.006457   6.890 5.57e-12 ***
smokingSi    0.004606   0.234458   0.020    0.984    
drinkingSi   0.017497   0.185291   0.094    0.925    
diabHistSi   0.732457   0.163528   4.479 7.50e-06 ***
htDXSi       0.010499   0.222508   0.047    0.962    
height      -0.007022   0.007923  -0.886    0.375    
waist        0.022091   0.005598   3.947 7.93e-05 ***
aveSP        0.024395   0.005420   4.501 6.77e-06 ***

In contrast, for the upsampled case, all of the predictors were significant:

              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.0755790  0.2305014 -26.358  < 2e-16 ***
sexFemale   -0.1409143  0.0302186  -4.663 3.11e-06 ***
age          0.0304018  0.0008032  37.849  < 2e-16 ***
smokingSi   -0.0691276  0.0309232  -2.235  0.02539 *  
drinkingSi   0.0538318  0.0243686   2.209  0.02717 *  
diabHistSi   0.6493554  0.0214752  30.238  < 2e-16 ***
htDXSi      -0.0809236  0.0281700  -2.873  0.00407 ** 
height      -0.0105632  0.0012510  -8.444  < 2e-16 ***
waist        0.0312969  0.0008087  38.699  < 2e-16 ***
aveSP        0.0237232  0.0006948  34.146  < 2e-16 ***

Which one you think is better? On the one hand, downsampling the data set I'm neglecting almost 20000 observations belonging to the majority class. On the other hand, when I upsample the minority class I'm duplicating the same 400 observations several times...

I know that I can look for a different classification threshold in the ROC curve instead of using down or upsampling to manage the original unbalanced dataset, but I've tried that and I'm not getting good results.

I also know that other methods like support vector machines can use a cost function in order to identify cases of the minority class, but I need the model to be interpretable and "user friendly". That's why I'm using logistic regression.

Gerardo Felix
  • 91
  • 2
  • 5

1 Answers1

10

NEVER use downsampling to make a method work. If the method is any good it will work under imbalance. Removal of samples is not scientific. Logistic regression works well under extreme imbalance. Also (1) logistic regression is not a classification method, (2) make sure you use proper accuracy scoring rules, and (3) logistic regression is not a multivariate (multiple dependent variables) method. It is a multivariable regression method.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 1
    Thank you! I agree with you. I don't like to remove samples (I'm not that comfortable duplicating samples either) but I'm getting 0 sensibility and 1 specificity with an accuracy of 98%. I've been using the metric "Kappa" to determine the best model but I get the same results. Is upsampling or the SMOTE algorithm better? – Gerardo Felix Mar 01 '16 at 14:57
  • 1
    If you insist on using discontinuous improper accuracy scoring rules you will continue to see such anomalies. – Frank Harrell Mar 01 '16 at 16:43
  • 3
    What scoring rules should I be using then? Can you elaborate a little bit about it? I'd appreciate it... – Gerardo Felix Mar 01 '16 at 22:20
  • 1
    See Section 10.6 (Chapter 10) of Course Notes at http://biostat.mc.vanderbilt.edu/rms – Frank Harrell Mar 02 '16 at 02:39
  • Downsampling is widely used by tech companies, not sure why you strongly advised against it.. – avocado Jan 19 '21 at 13:03
  • 1
    You think that the fact that tech companies do something exceedlingly stupid makes it OK? **ANY** method that requires you to delete data is ill-advised. And the link I provided above has been updated to https://hbiostat.org/rms – Frank Harrell Jan 19 '21 at 13:16
  • @avocado Do you have an example of it being used in a situation where proper statistical methods would be less effective? (Does the community expect someone to respond that no such example exists?) – Dave Jun 17 '21 at 14:16
  • I don't have an idea of what that question means. Please re-phrase. – Frank Harrell Jun 17 '21 at 20:13
  • Tech companies use down sampling because their models are low-risk. So whether you throw away data, use a bad scoring rule, or mis-classify people... doesn't hardly matter. That they make automated predictions on a large scale is what matters, not the quality of individual predictions. quantity > quality. – JTH Jun 17 '21 at 21:30
  • This is terrible statistical practice and will result in suboptimal decisions. Don't they care about that? Probability models work fine with extremely imbalanced data, and reflect the reality that when an outcome is very rare you can't predict outcomes, you can only predict tendencies. And you still need a utility/loss/cost function separate from all that to get optimal decisions. https://hbiostat.org/post/classification – Frank Harrell Jun 17 '21 at 22:26
  • @JTH Thanks for your comment on this, however I can hardly agree on the statement of *quantity > quality*. Could you please give an example what quality means here? As an counter-example, it's widely known in ads industry that the predictions of click or purchase should be as accurate as possible, since the advertisers will be charged by a bidding formula which relies on the predictions, so if the predictions are not high-*quality*, then advertisers are wasting/winning on the ads. – avocado Jun 19 '21 at 09:30
  • 1
    @FrankHarrell Appreciate your patience in the comments! I actually didn't mean *the fact that tech companies do something exceedlingly stupid makes it OK*, what I was trying to say is why they are doing this, there has to be a reason to do it so widely. I agree with you that *Probability models work fine with extremely imbalanced data, and reflect the reality that when an outcome is very rare you can't predict outcomes*, but you also mentioned that *you can only predict tendencies*. Could you please elaborate more about the difference between probabilities and *tendencies*? – avocado Jun 19 '21 at 09:44
  • Tendency = probability – Frank Harrell Jun 19 '21 at 12:09
  • Reverting back to @GerardoFelix question about what to use, I certainly don't want to put words in @Frank Harell's mouth, but when I was writing an [ML paper](https://arxiv.org/abs/2105.09379) in which there was unbalanced classes, I decided to use the Brier Score. If you are using `caret` maybe this StackExchange answer will help: https://stackoverflow.com/questions/61014688/r-caret-package-brier-score/67117137#67117137 – Avraham Dec 22 '21 at 16:48
  • Brier works fine in that setting. To better interpret it you might randomly permute the outcome variable and recompute Brier to get the worst case (highest). – Frank Harrell Dec 22 '21 at 16:58