Questions tagged [oversampling]

Sampling cases with differential probability, so that classes that occur rarely in the population occur more often in the training data. Does *not* address the problems in unbalanced classes.

unbalanced-classes do pose problems, but contrary to common misunderstandings, these are merely due to low sample size (high variance of predictors), not the unbalancedness per se. As such, oversampling will not help.

See Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? and links there.

94 questions

votes

1 answer

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

TL;DR See title. Motivation I am hoping for a canonical answer along the lines of "(1) No, (2) Not applicable, because (1)", which we can use to close many wrong questions about unbalanced datasets and oversampling. I would be quite as happy to be…

unbalanced-classes oversampling

asked Jul 16 '18 at 21:22

Stephan Kolassa

95,027
13
197
357

votes

2 answers

Testing Classification on Oversampled Imbalance Data

I am working on severely imbalanced data. In literature, several methods are used to re-balance the data using re-sampling (over- or under-sampling). Two good approaches are: SMOTE: Synthetic Minority Over-sampling TEchnique (SMOTE) ADASYN:…

classification dataset resampling unbalanced-classes oversampling

asked May 28 '13 at 02:17

Fares

votes

1 answer

Opinions about Oversampling in general, and the SMOTE algorithm in particular

What is your opinion about oversampling in classification in general, and the SMOTE algorithm in particular? Why would we not just apply a cost/penalty to adjust for imbalance in class data and any unbalanced cost of errors? For my purposes,…

machine-learning classification oversampling

asked Sep 08 '16 at 16:03

Dave Cummins

votes

1 answer

ROSE and SMOTE oversampling methods

Can somebody give me a brief explanation of the differences between those two resampling methods : ROSE and SMOTE ?

unbalanced-classes oversampling smote

asked Aug 10 '15 at 11:35

Martin

votes

2 answers

Sampling with replacement in R randomForest

The randomForest implementation does not allow sampling beyond the number of observations, even when sampling with replacement. Why is this? Works fine: rf <- randomForest(Species ~ ., iris, sampsize=c(1, 1, 1), replace=TRUE) rf <-…

r sampling random-forest stratification oversampling

asked Dec 11 '12 at 12:40

cohoz

votes

1 answer

Oversampling with categorical variables

I would like to perform a combination of oversampling and undersampling in order to balance my dataset with roughly 4000 customers divided into two groups, where one of the groups have a proportion of roughly 15%. I've looked into SMOTE…

unbalanced-classes resampling oversampling

asked Jun 18 '14 at 11:22

pir

4,626
10
38
73

votes

1 answer

SMOTE throws error for multi class imbalance problem

I am trying to use SMOTE to correct imbalance in my multi-class classification problem. Although SMOTE works perfectly on the iris dataset as per the SMOTE help document, it does not work on a similar dataset. Here is how my data looks. Note it has…

r classification unbalanced-classes oversampling

asked Jul 26 '13 at 19:31

tan

votes

1 answer

Normalization/standardization: Should one do this before oversampling/undersampling the data or after?

When working with imbalanced datasets, should one do one-hot encoding and data standardization before or after sampling techniques (such as oversampling or undersampling)?

machine-learning oversampling

asked Aug 21 '18 at 20:34

Akshatha Fakkeriah Kallappanam

votes

1 answer

Oversampling: whole set or training set

I have a rather small dataset of 4 000 points (140 features) to feed to a NN binary classifier. The problem is only ~700 of them represent the second class. Is it more common to resample the whole data set and then split, or first split and then…

neural-networks small-sample resampling model-evaluation oversampling

asked Jun 20 '18 at 06:51

Felix

votes

0 answers

After oversampling/undersampling is it always appropriate to adjust probabilities using the odds ratio regardless of the sampling method used?

I have an imbalanced dataset where the target class is <1% of sample. I apply oversampling or undersampling e.g. https://github.com/scikit-learn-contrib/imbalanced-learn. I run random forest on the resampled data I adjust probabilities back to the…

odds-ratio unbalanced-classes oversampling

asked Nov 23 '16 at 11:44

simon

votes

1 answer

Oversampling in logistic regression

I was trying to find out whether an oversampling can really make a model better. On this blog page, it says that it can improve a decision tree, but it shouldn't improve a logistic regression. Quotation below: Standard statistical techniques are…

logistic oversampling

asked Jan 25 '12 at 12:04

Tomek Tarczynski

3,854
7
29
37

votes

1 answer

Oversampling correction for multinomial logistic regression

When modeling rare events with logistic regression, oversampling is a common method to reduce computation complexity (i.e., keep all the rare positive cases but just a subsample of negative cases). After model fitting, adding a offset to the…

regression logistic multinomial-distribution oversampling

asked Jan 28 '14 at 01:09

bythemark

votes

1 answer

Classification of Huge number of classes

I have a dataset of samples belonging to >100 classes. I want to classify and/or cluster these classes. I have the following questions: 1) Is one classifier efficient for such problem? or one classifier for each one/subset of classes? (From my point…

classification oversampling

asked Jun 04 '13 at 06:41

Abbas

votes

1 answer

During oversampling of rare events, why are the beta coefficients of the independent variables not affected, but only the intercept?

I have followed the King and Zeng paper and understand the consistency of the prior correction after oversampling in logistic regression. But I am trying to understand why the beta coefficients of the independent variables are not affected by the…

logistic multiple-regression rare-events case-control-study oversampling

asked Aug 05 '17 at 05:30

Kingstat

votes

1 answer

Problem with classifier after using SMOTE to balance the data

We've ran into a problem while training a classifier on an unbalanced data set. The response is binary with 0 indicating 'non defaulter' and 1 indicating 'defaulter' (it's a credit scoring task). The defaulters only account for 0.47 % (233…

classification unbalanced-classes oversampling

asked May 04 '13 at 14:46

Eric Paulsson

2 3 4 5 6 7 Next