Opinions about Oversampling in general, and the SMOTE algorithm in particular

Question

What is your opinion about oversampling in classification in general, and the SMOTE algorithm in particular? Why would we not just apply a cost/penalty to adjust for imbalance in class data and any unbalanced cost of errors? For my purposes, accuracy of prediction to a future set of experimental units is the ultimate measure.

For reference, the SMOTE paper: http://www.jair.org/papers/paper953.html

One problem with oversampling a minority class in an imbalanced dataset is you'd end up learning too much of the specific of the few examples, and that wouldn't generalize well. SMOTE is supposed to learn the topological properties of the neighborhood of those points in the minority class, so you are less likely to overfit. — horaceT, Sep 08 '16 at 16:44
This is a great topic for a question, but could you make it a bit more focused? "What is your opinion?" invites endless discussion but we tend to focus more sharply on a question/answer format. — Sycorax, Sep 08 '16 at 17:07

Franck Dernoncourt · Answer 1 · 2020-12-27T21:48:46.557

{1} gives a list of advantages and disadvantages of cost-sensitive learning vs. sampling:

2.2 Sampling

Oversampling and undersampling can be used to alter the class distribution of the training data and both methods have been used to deal with class imbalance [1, 2, 3, 6, 10, 11]. The reason that altering the class distribution of the training data aids learning with highly-skewed data sets is that it effectively imposes non-uniform misclassification costs. For example, if one alters the class distribution of the training set so that the ratio of positive to negative examples goes from 1:1 to 2:1, then one has effectively assigned a misclassification cost ratio of 2:1. This equivalency between altering the class distribution of the training data and altering the misclassification cost ratio is well known and was formally described by Elkan [9].

There are known disadvantages associated with the use of sampling to implement cost-sensitive learning. The disadvantage with undersampling is that it discards potentially useful data. The main disadvantage with oversampling, from our perspective, is that by making exact copies of existing examples, it makes overfitting likely. In fact, with oversampling it is quite common for a learner to generate a classification rule to cover a single, replicated, example. A second disadvantage of oversampling is that it increases the number of training examples, thus increasing the learning time.

2.3 Why Use Sampling?

Given the disadvantages with sampling, it is worth asking why anyone would use it rather than a cost-sensitive learning algorithm for dealing with data with a skewed class distribution and non-uniform misclassification costs. There are several reasons for this. The most obvious reason is there are not cost-sensitive implementations of all learning algorithms and therefore a wrapper-based approach using sampling is the only option. While this is certainly less true today than in the past, many learning algorithms (e.g., C4.5) still do not directly handle costs in the learning process.

A second reason for using sampling is that many highly skewed data sets are enormous and the size of the training set must be reduced in order for learning to be feasible. In this case, undersampling seems to be a reasonable, and valid, strategy. In this paper we do not consider the need to reduce the training set size. We would point out, however, that if one needs to discard some training data, it still might be beneficial to discard some of the majority class examples in order to reduce the training set size to the required size, and then also employ a cost-sensitive learning algorithm, so that the amount of discarded training data is minimized.

A final reason that may have contributed to the use of sampling rather than a cost-sensitive learning algorithm is that misclassification costs are often unknown. However, this is not a valid reason for using sampling over a costsensitive learning algorithm, since the analogous issue arises with sampling—what should the class distribution of the final training data be? If this cost information is not known, a measure such as the area under the ROC curve could be used to measure classifier performance and both approaches could then empirically determine the proper cost ratio/class distribution.

They also did a series of experiments, which was inconclusive:

Based on the results from all of the data sets, there is no definitive winner between cost-sensitive learning, oversampling and undersampling

They then try to understand which criteria in the datasets may hint at which technique is better fitted.

They also remark that SMOTE may bring some enhancements:

There are a variety of enhancements that people have made to improve the effectiveness of sampling. Some of these enhancements include introducing new “synthetic” examples when oversampling [5 -> SMOTE], deleting less useful majority- class examples when undersampling [11] and using multiple sub-samples when undersampling such than each example is used in at least one sub-sample [3]. While these techniques have been compared to oversampling and undersampling, they generally have not been compared to cost-sensitive learning algorithms. This would be worth studying in the future.

{2} is also worth reading:

In this study, we systematically investigate the impact of class imbalance on the classification performance of convolutional neural networks (CNNs) and compare frequently used methods to address the issue. Class imbalance is a common problem that has been comprehensively studied in classical machine learning, yet very limited systematic research is available in the context of deep learning.

References:

{1} Weiss, Gary M., Kate Mc Carthy, and Bibi Zabar. "Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?." DMIN 7 (2007): 35-41. https://scholar.google.com/scholar?cluster=10779872536070567255&hl=en&as_sdt=0,22 ; https://pdfs.semanticscholar.org/9908/404807bf6b63e05e5345f02bcb23cc739ebd.pdf
{2} Buda, Mateusz, Atsuto Maki, and Maciej A. Mazurowski. "A systematic study of the class imbalance problem in convolutional neural networks." Neural Networks 106 (2018): 249-259. https://arxiv.org/abs/1710.05381

When you say "cost sensitive learning algorithm", should my brain think "penalize classes with high frequencies of occurrence and possibly assign more importance to classes with low frequencies"? Is this concept equivalent to assigning class weights? — Jarad, Apr 20 '18 at 20:10

Opinions about Oversampling in general, and the SMOTE algorithm in particular

1 Answers1

Linked

Related