Questions tagged [smote]

SMOTE stands for "Synthetic Minority Over-sampling Technique". It is a method to deal with imbalanced data.

SMOTE is a method to oversample unbalanced data in a typical classification problem.


The feature space for the minority class for which we want to oversample could be continuous. To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.

54 questions
13
votes
1 answer

ROSE and SMOTE oversampling methods

Can somebody give me a brief explanation of the differences between those two resampling methods : ROSE and SMOTE ?
Martin
  • 301
  • 1
  • 2
  • 8
6
votes
1 answer

Train on balanced datasets, used for imbalanced datasets?

We usually trained a model using balanced datasets. Even when we do not have a balanced datasets, we will use methods such as SMOTE to create a balanced dataset for training. The question is - how reliable will be the trained model when it is…
Stuart Peterson
  • 361
  • 1
  • 6
5
votes
1 answer

Python / Keras: SMOTE and validation_split

I try to train a MLP with an imbalanced dataset. I'd like to use SMOTE to balance my classes; as highlighted here (https://beckernick.github.io/oversampling-modeling/), the class rebalancing should always be done after splitting into train / test…
Requin
  • 505
  • 4
  • 15
5
votes
1 answer

SMOTE data balance - before or during Cross-Validation

I'm using Random Forest in the CARET package to tag a binary outcome with 1/10 ratio, thus I need to balance the dataset. I know two ways: Use SMOTE as a stand-alone function and then pass it to the training. Use sampling='smote' inside CARET's…
Riddle-Master
  • 433
  • 3
  • 14
4
votes
0 answers

PCA, SMOTE and cross validation - how to combine them together?

I was reading a lot recently about PCA and cross validation and it seems that the majority call it malpractice to do PCA before cross validation. I would also like to perform SMOTE, but there is a split between those who perform SMOTE before or…
3
votes
2 answers

Run time of SMOTE function in package DMwR

I have a dataframe with 930 000 rows and 220 variables. The objective is a binary classification but my response classes are imbalanced. (88% - 12%) I want to use SMOTE to artificially create observations for the rare event but the function takes…
LeGossler
  • 31
  • 2
3
votes
3 answers

Running XGBoost with *highly* imbalanced data returns near 0% true positive rate. Tried SMOTE and it did not improve much. What else can I do?

I'm using XGBoost on a dataset of ~2.8M records of hard drive failures, where less than 200 are tagged as failures. After cleaning, there are 11 features in this dataset. Below is my R code, as well as a link to the dataset I uploaded to my S3…
Ray
  • 141
  • 1
  • 1
  • 4
3
votes
0 answers

Should SMOTE oversampling be done before or after holdout validation's training/testing split?

Originally, without SMOTE, my ML learning steps go like this: Feature vectorization split data into X_train, X_test, y_train, and y_test use X_train and y_train for machine learning predict/test on X_test and y_test I think there are two spots…
2
votes
1 answer

Can oversampling be moved outside stratified k-fold CV?

In a binary classification task, I am using imbalanced-learn's implementation of SMOTENC to oversample the positive class of a very imbalanced dataset. The total number of examples is very high, so that this oversampling takes quite a while. I would…
2
votes
1 answer

Oversampling methods for numerical data (regression)

There are many oversampling methods for categorical labels (for example SMOTE and Rose, etc.). But, are there oversampling method for numerical labels (the thing that I want to predict with my features), in the sense that it applies something…
2
votes
3 answers

Binary Classification in Imbalanced Data; Oversampling and Imputation

Together with two friends I participate in a university course about data mining in R and we chose the topic of bankruptcy prediction. We started with some "clean" data found on an "In class" kaggle competition and comparing to the leaderboard our…
1
vote
1 answer

Smote algorithm

When our dataset has 5 or more attributes, what will be the method of producing a new sample with Smote algorithm? How will the Euclidean distance with 5 or more attributes be calculated?
user346917
  • 11
  • 1
1
vote
1 answer

Getting different results when running SMOTE

I have this code which runs SMOTE and then getting roc_auc_score. The issue is that every I run the code on the same dataset, I get different results. How can I fix this? I need the same sample when ruining my code and the same results. The ROC…
1
vote
0 answers

Designing an experiment to compare how multiple SMOTE variants affect multiple classification models on multiple datasets

For a university paper I want to test a hypothesis that one particular SMOTE variant outperforms two other SMOTE variants. By 'outperforms' I'm looking at using the F1 measure. I want to test this using multiple datasets (let's say 20 datasets)…
1
vote
1 answer

A question about a logistic regression classifier performance (with and without resampling)

I am working on a dataset with 20 independent variables and 41188 instances. The task is a binary classification where the target variable has 36548 number of no's and 4640 of yes's. I have used logistic regression model with 10 folds of cross…
1
2 3 4