1

I have a huge dataset, say around 100M data points, with a class imbalance of 1 positive for every 100 negatives. It is very difficult to train on the entire dataset, so I tend to undersample the negatives such that the training data becomes balanced (1:1). But the test set remains imbalanced to reflect the real life nature of the data. FYI, I use a simple feedforward neural network

How do I go about training in such a combination? I'd use class weights during training but I'm thinking it might overpredict on the test set? Moreover how do I evaluate this model with AUC and AUPRC, do I need to use some form of weighting?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
HMK
  • 159
  • 4
  • 5
    Unbalanced classes are almost certainly not a problem, and oversampling, undersampling or weighting will not solve this non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Dec 07 '21 at 10:35
  • 3
    Don't use accuracy, precision, recall, sensitivity, specificity, or the F1 score. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Dec 07 '21 at 10:35
  • 2
    Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). – Stephan Kolassa Dec 07 '21 at 10:35
  • @StephanKolassa, I interpreted "It is very difficult to train on the entire dataset" to mean because of the very large size, rather than a statement of trying to "fix" the imbalance. – Ben Reiniger Dec 07 '21 at 23:18
  • @BenReiniger: guilty as charged, of using boilerplate comments, which I have in a txt file because imbalance/oversampling questions crop up here *regularly*. I unfortunately don't have the time to tailor answers to every one of these questions and believe that these comments may still be helpful, even if they don't address the exact question. In the present case, I would recommend sampling the training data "as is", without over-/undersampling a class. – Stephan Kolassa Dec 08 '21 at 07:53
  • 2
    The choice of model performance statistic depends on the aim of the analysis or the needs of the application. Proper scoring rules are a good idea for model selection, but model selection and performance evaluation are not the same thing. If you are using AUC, that implies you are primarily interested in the ranking of the patterns, in which case the imbalance is likely to be irrelevant and you don't need to do anything. – Dikran Marsupial Dec 08 '21 at 15:37
  • 2
    Whatever you do, don't undersample unless you need to to make the problem computationally tractable. First you need to work out what performance metric is important for your application, and why, and then work out what to do. The answer is not necessarily a proper scoring rule https://stats.stackexchange.com/a/538524/887 – Dikran Marsupial Dec 08 '21 at 15:38
  • 1
    Sadly the answer to @StephanKolassa's quesiton https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he is not correct (or at least is not a good answer to the question). Class imbalance problems can arise if you don't have enough data to characterize the distribution of the minority class. You will not observe a class imbalance problem if you provide a large amount of data. Most estimation problems go away if you throw enough data at them. – Dikran Marsupial Dec 08 '21 at 15:44
  • @DikranMarsupial: I confess I am a bit surprised. "You will not observe a class imbalance problem if you provide a large amount of data." Can you explain what you mean by that, perhaps add a few references? – Stephan Kolassa Dec 08 '21 at 15:47
  • I'm working on a tutorial on this. One way of looking at it is that the minimum likelihood estimators have a bias, the paper by Gary King mentions this https://gking.harvard.edu/files/0s.pdf . The class imbalance problem does exist, but as far as I can see, there isn't actually anything you can do about it (as there isn't enough data to work out how much correction to apply). It is not clear that it can even be diagnosed https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance – Dikran Marsupial Dec 08 '21 at 15:51
  • 1
    I have discussed this in some of my answers on related questions, but I can't find it immediately. Essentially the problem isn't the imbalance per-se, but that if you have an imbalanced dataset, then it needs to be very large in order to provide sufficient data to characterise the distribution of the negative class properly, so it is effectively a small-sample estimation problem. – Dikran Marsupial Dec 08 '21 at 15:54
  • just to add MLE has good asymptotic properties, so as you add more data, the bias rapidly becomes too small to matter. – Dikran Marsupial Dec 08 '21 at 15:57
  • I disagree that Stephan Kolassa gives a poor answer there. I think it just addresses a different issue. Dikran, your point a couple of times has been that class imbalance is a problem when it means that you lack data on the minority class, but that doesn't make class imbalance the issue. It means that lack of data is the issue. – Dave Dec 08 '21 at 16:07
  • 1
    Yes, if we have $1000$ observations divided up $990$ and $10$, we probably can't characterize the distribution of the minority class like we could with $500$ and $500$, but if we have $1000000$ observations divided up $990000$ and $10000$, we still have the same class imbalance but lack the issue of having too few observations to characterize distribution of the the minority class. – Dave Dec 08 '21 at 16:07
  • @Dave, that *is* the class imbalance problem. There is no other problem with imbalanced classes. AFAIK, if you can suggest another problem with them, I'd be keen to hear about it. BTW my experience with SVMs suggest they don't have a problem with imbalance in large datasets either. Also I didn't say it was a poor answer, I said it wasn't a good one. My choice of words was deliberate and careful. – Dikran Marsupial Dec 08 '21 at 16:24
  • I don't see class imbalance as a problem, but think about how many approaches there are to dealing with class imbalance. People will use sensitivity, specificity, $F_1$ score since accuracy can be an impressive-looking $99\%$ even though the prior probability of being in one of the classes is $0.995$. People downsample their data to get the classes to be balanced. People run [SMOTE](https://twitter.com/f2harrell/status/1062424969366462473?lang=en) to balance the data. *Someone* thinks it's an issue not to have a 50/50 split, even if we disagree with that. – Dave Dec 08 '21 at 16:50
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/132159/discussion-between-dikran-marsupial-and-dave). – Dikran Marsupial Dec 08 '21 at 17:14

1 Answers1

2

The primary effect on a model of downsampling like this is a shift in the predicted log-odds. This is rigorously shown for logistic regression, see https://stats.stackexchange.com/a/68726/232706; for other models, I've observed the same effect (though I don't do a lot of neural nets). Assuming that's really true in your case, you can "fix" the probability estimates by adding the adjustment term listed in the above link. Note too that such a monotonic adjustment should not affect AUROC or AUPRC at all.

Using class weights in training instead should produce a very similar effect; see https://datascience.stackexchange.com/a/58899/55122

You have "plenty" (difficult to say without more context, but 1M is a lot) of positive examples, so the suggestion to sample without affecting class balance given by @StephanKolassa in a comment may also be fine. In other contexts where the positive class is so small that you wouldn't want to throw away any information from them, I think downsampling the giant negative class is fine (and note that Scortchi in the first link mentions exactly this case).

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15