Running SMOTE on very large class imbalanced datasets - batched or subsampled implementations

Question

There is a theoretical and computational aspect to this question.

I was trying to use SMOTE to reduce class imbalance in a rather large dataset--about 8 million rows. The data has a binary outcome variable and 5 categorical variables. I was using the python imbalance-learn package, but the package basically used all 64 GB of my RAM and generally kept crashing with no result. Now that is an understandable outcome, since there adding dimensions to 8 million row matrices or computing the nearest neighbors, etc., is computationally expensive.

So I was trying to figure out strategies to handle the computation better. Since SMOTE, ADASYN, and other similar tools rely on nearest-neighbor matches, is there a way to breakdown the dataset into pieces, run the algorithm on them, and then reconstruct the total dataset? I have not seen any articles on something like this. I can think of a few different ways to do this, but I was not sure if there is any experimentation on something like this.

Why do you think unbalance is a problem? See https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning, https://stats.stackexchange.com/questions/235808/binary-classification-with-strongly-unbalanced-classes, https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression — kjetil b halvorsen, Dec 22 '19 at 13:43
I mean the imbalance problem is pretty obvious when I run the data. The output from training reads 99.999% accuracy while the False negative rate is 100% meaning that it gets every single True case wrong. By resampling this actually helped the problem. — krishnab, Dec 22 '19 at 16:55
Which method do you use for classification? Please read those linked posts better, and this one: https://stats.stackexchange.com/questions/404960/imbalanced-data-set-rare-class-v-s-rare-events/404962#404962 Accuracy is not a proper scoring rule, and should not be used! — kjetil b halvorsen, Dec 22 '19 at 23:57
Were you able to find a solution to using SMOTE for very large data? It is a problem that I am currently having and I would like to know if you have any solutions. — Bruno Manuel Cavagnaro Olcese, Mar 06 '21 at 16:43
@BrunoManuelCavagnaroOlcese I basically broke my batches into say 2 chunks. In the first chunk I just randomly select examples like normal. In the second chunk, I only sample the low frequency examples. That way, in each batch there are always example of the uncommon cases and hence the loss measurement does not get away with just predicting that all examples are from the same class. — krishnab, Mar 07 '21 at 15:22
@krishnab "meaning that it gets every single True case wrong. " that is not evidence of a class imbalance *problem*. For some learning tasks that is the optimal solution as the density of minority class patterns is never higher than that of majority class patterns anywhere in the input space. This seems to be a difficult problem to diagnose https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance — Dikran Marsupial, Aug 19 '21 at 16:48
What’s wrong with getting $99.999\%$ accuracy? // I’d be interested in the ROCAUC of that model. With a gigantic class imbalance, it might be that the minority class never gets a predicted probability above the default cutoff threshold of $0.5$. ROCAUC is not a strictly proper scoring rule like log loss or Brier score, but it does not depend on a single threshold like accuracy or false negative rate. — Dave, Dec 19 '21 at 05:15

score 0 · Answer 1 · answered Feb 27 '20 at 23:54

Regarding the computational question, I'd look into parallel computing. Maybe there's a way too split up the tasks and let every core run a part of the algorithm. Though the task combined with the size of the dataset is definitely not meant for a casual home PC.

Regarding the theoretical question (besides using resampling techniques like SMOTE, ROSE, ADASYN and many others) I'd look into cost-sensitive learning and switching to other performance metrics than accuracy which is definitely not what you want use for imbalanced data classification. Rather use the AUC, F1 score, Precision & Recall, etc.

Running SMOTE on very large class imbalanced datasets - batched or subsampled implementations

1 Answers1