5

I’m working on a classification problem where dataset is extremely imbalanced ( roughly 13000 "zero" and 100 "one" responses).

As the first step, I trained a Logistic Regression and changing the cutoff probability, managed to predict most of the “one” responses correctly, but a reasonable number of “zero” responses were incorrectly classified as “one”.

So I would like to know that, what are the good algorithm which can properly handle imbalance datasets?

Thanks,

P.S. I’m looking at algorithms which are available in scikit-learn or as a R package.

Upul
  • 647
  • 1
  • 7
  • 14
  • 1
    Check this question: http://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning/133385 – Lucas Gallindo Jan 28 '15 at 11:47
  • 2
    Many algorithms support imbalanced data sets using weighting. For example, SVM (e.g. http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html) and random forest (e.g. http://stackoverflow.com/questions/20082674/unbalanced-classification-using-randomforestclassifier-in-sklearn). – etov Jan 28 '15 at 11:49
  • There are many reviews on this topic including [this one](http://dl.acm.org/citation.cfm?id=2907070). In general my experience has been that ensemble methods with minority class oversampling perform well but there is no free lunch. – Krrr Sep 01 '17 at 05:59

2 Answers2

4

I recommend re-sampling techniques to balance the training dataset. They can be divided in four categories: undersampling the majority class, oversampling the minority class, combining over and under sampling, and creating an ensemble of balanced datasets.

The above methods and more are implemented in the imbalanced-learn library in Python that interfaces with scikit-learn. I recommend trying an combined method such as SMOTE + Tomek links to see if classification accuracy improves on a balanced dataset.

See ipython notebook for an example.

Vadim Smolyakov
  • 878
  • 6
  • 25
  • Why do you reccomend these? What problem do they solve? – Matthew Drury Aug 01 '17 at 00:12
  • most classification algorithms will perform optimally when the number of samples in each class is roughly the same, one can use re-sampling to arrive at a more accurate decision boundary. E.g. SMOTE generates synthetic samples using KNN while Tomek links remove unwanted overlap between classes. See http://ieeexplore.ieee.org/document/5128907/ for a review. – Vadim Smolyakov Aug 01 '17 at 17:44
  • I don't believe that is true, and it not the consensus in the questions on this site: https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning – Matthew Drury Aug 01 '17 at 20:06
  • yeah, i found little discussion on which algorithms are affected the most by the imbalanced datasets. i can imagine imbalanced data could be a problem for a simple online learning algorithm like perceptron where the order of points matters in updating the classification boundary, in the case of perceptron the decision boundary will look different if the data classes were roughly balanced compared to seeing the majority class labels; also in decision trees, for imbalanced data, it's easy to achieve high accuracy just by predicting the majority class label all the time. – Vadim Smolyakov Aug 01 '17 at 21:08
  • @MatthewDrury While the imbalance may not be a problem theoretically wise it makes inference harder technically/numerically. For one, it distorts "conventional" classification performance metrics like AUROC, accuracy etc. So one needs to come up with something "exotic" like F1, Jaccard index, Kappa or the like. Also, during the model selection process of cross-validation or bootstrapping it may happen so that some steps would miss the minority class entirely if the imbalance is heavy enough. – ayorgo Sep 21 '19 at 17:48
2

I would use stratified sampling methods (e.g. http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels) Sorry for not being able to give a precise strategy nor an explanation what exactly these algorithms do in the background, but I hope this helps nonetheless.