Questions tagged [oversampling]

Sampling cases with differential probability, so that classes that occur rarely in the population occur more often in the training data. Does *not* address the problems in unbalanced classes.

do pose problems, but contrary to common misunderstandings, these are merely due to low sample size (high variance of predictors), not the unbalancedness per se. As such, oversampling will not help.

See Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? and links there.

94 questions
66
votes
1 answer

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

TL;DR See title. Motivation I am hoping for a canonical answer along the lines of "(1) No, (2) Not applicable, because (1)", which we can use to close many wrong questions about unbalanced datasets and oversampling. I would be quite as happy to be…
Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
26
votes
2 answers

Testing Classification on Oversampled Imbalance Data

I am working on severely imbalanced data. In literature, several methods are used to re-balance the data using re-sampling (over- or under-sampling). Two good approaches are: SMOTE: Synthetic Minority Over-sampling TEchnique (SMOTE) ADASYN:…
22
votes
1 answer

Opinions about Oversampling in general, and the SMOTE algorithm in particular

What is your opinion about oversampling in classification in general, and the SMOTE algorithm in particular? Why would we not just apply a cost/penalty to adjust for imbalance in class data and any unbalanced cost of errors? For my purposes,…
Dave Cummins
  • 229
  • 1
  • 2
  • 3
13
votes
1 answer

ROSE and SMOTE oversampling methods

Can somebody give me a brief explanation of the differences between those two resampling methods : ROSE and SMOTE ?
Martin
  • 301
  • 1
  • 2
  • 8
12
votes
2 answers

Sampling with replacement in R randomForest

The randomForest implementation does not allow sampling beyond the number of observations, even when sampling with replacement. Why is this? Works fine: rf <- randomForest(Species ~ ., iris, sampsize=c(1, 1, 1), replace=TRUE) rf <-…
cohoz
  • 618
  • 5
  • 16
11
votes
1 answer

Oversampling with categorical variables

I would like to perform a combination of oversampling and undersampling in order to balance my dataset with roughly 4000 customers divided into two groups, where one of the groups have a proportion of roughly 15%. I've looked into SMOTE…
pir
  • 4,626
  • 10
  • 38
  • 73
10
votes
1 answer

SMOTE throws error for multi class imbalance problem

I am trying to use SMOTE to correct imbalance in my multi-class classification problem. Although SMOTE works perfectly on the iris dataset as per the SMOTE help document, it does not work on a similar dataset. Here is how my data looks. Note it has…
tan
  • 527
  • 2
  • 5
  • 13
9
votes
1 answer

Normalization/standardization: Should one do this before oversampling/undersampling the data or after?

When working with imbalanced datasets, should one do one-hot encoding and data standardization before or after sampling techniques (such as oversampling or undersampling)?
9
votes
1 answer

Oversampling: whole set or training set

I have a rather small dataset of 4 000 points (140 features) to feed to a NN binary classifier. The problem is only ~700 of them represent the second class. Is it more common to resample the whole data set and then split, or first split and then…
9
votes
0 answers

After oversampling/undersampling is it always appropriate to adjust probabilities using the odds ratio regardless of the sampling method used?

I have an imbalanced dataset where the target class is <1% of sample. I apply oversampling or undersampling e.g. https://github.com/scikit-learn-contrib/imbalanced-learn. I run random forest on the resampled data I adjust probabilities back to the…
simon
  • 349
  • 1
  • 9
7
votes
1 answer

Oversampling in logistic regression

I was trying to find out whether an oversampling can really make a model better. On this blog page, it says that it can improve a decision tree, but it shouldn't improve a logistic regression. Quotation below: Standard statistical techniques are…
Tomek Tarczynski
  • 3,854
  • 7
  • 29
  • 37
6
votes
1 answer

Oversampling correction for multinomial logistic regression

When modeling rare events with logistic regression, oversampling is a common method to reduce computation complexity (i.e., keep all the rare positive cases but just a subsample of negative cases). After model fitting, adding a offset to the…
6
votes
1 answer

Classification of Huge number of classes

I have a dataset of samples belonging to >100 classes. I want to classify and/or cluster these classes. I have the following questions: 1) Is one classifier efficient for such problem? or one classifier for each one/subset of classes? (From my point…
Abbas
  • 485
  • 1
  • 4
  • 12
6
votes
1 answer

During oversampling of rare events, why are the beta coefficients of the independent variables not affected, but only the intercept?

I have followed the King and Zeng paper and understand the consistency of the prior correction after oversampling in logistic regression. But I am trying to understand why the beta coefficients of the independent variables are not affected by the…
5
votes
1 answer

Problem with classifier after using SMOTE to balance the data

We've ran into a problem while training a classifier on an unbalanced data set. The response is binary with 0 indicating 'non defaulter' and 1 indicating 'defaulter' (it's a credit scoring task). The defaulters only account for 0.47 % (233…
1
2 3 4 5 6 7