Questions tagged [smote]

SMOTE stands for "Synthetic Minority Over-sampling Technique". It is a method to deal with imbalanced data.

SMOTE is a method to oversample unbalanced data in a typical classification problem.

The feature space for the minority class for which we want to oversample could be continuous. To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.

54 questions

votes

1 answer

ROSE and SMOTE oversampling methods

Can somebody give me a brief explanation of the differences between those two resampling methods : ROSE and SMOTE ?

unbalanced-classes oversampling smote

asked Aug 10 '15 at 11:35

Martin

votes

1 answer

Train on balanced datasets, used for imbalanced datasets?

We usually trained a model using balanced datasets. Even when we do not have a balanced datasets, we will use methods such as SMOTE to create a balanced dataset for training. The question is - how reliable will be the trained model when it is…

unbalanced-classes smote

asked Sep 22 '18 at 20:22

Stuart Peterson

votes

1 answer

Python / Keras: SMOTE and validation_split

I try to train a MLP with an imbalanced dataset. I'd like to use SMOTE to balance my classes; as highlighted here (https://beckernick.github.io/oversampling-modeling/), the class rebalancing should always be done after splitting into train / test…

neural-networks validation keras smote

asked Feb 06 '19 at 21:25

Requin

votes

1 answer

SMOTE data balance - before or during Cross-Validation

I'm using Random Forest in the CARET package to tag a binary outcome with 1/10 ratio, thus I need to balance the dataset. I know two ways: Use SMOTE as a stand-alone function and then pass it to the training. Use sampling='smote' inside CARET's…

r cross-validation unbalanced-classes smote

asked May 16 '18 at 12:48

Riddle-Master

votes

0 answers

PCA, SMOTE and cross validation - how to combine them together?

I was reading a lot recently about PCA and cross validation and it seems that the majority call it malpractice to do PCA before cross validation. I would also like to perform SMOTE, but there is a split between those who perform SMOTE before or…

machine-learning classification pca cross-validation smote

asked Feb 18 '19 at 11:26

tsumaranaina

votes

2 answers

Run time of SMOTE function in package DMwR

I have a dataframe with 930 000 rows and 220 variables. The objective is a binary classification but my response classes are imbalanced. (88% - 12%) I want to use SMOTE to artificially create observations for the rare event but the function takes…

r oversampling smote

asked Jul 15 '19 at 07:42

LeGossler

votes

3 answers

Running XGBoost with highly imbalanced data returns near 0% true positive rate. Tried SMOTE and it did not improve much. What else can I do?

I'm using XGBoost on a dataset of ~2.8M records of hard drive failures, where less than 200 are tagged as failures. After cleaning, there are 11 features in this dataset. Below is my R code, as well as a link to the dataset I uploaded to my S3…

r classification boosting oversampling smote

asked Jan 17 '19 at 00:28

Ray

votes

0 answers

Should SMOTE oversampling be done before or after holdout validation's training/testing split?

Originally, without SMOTE, my ML learning steps go like this: Feature vectorization split data into X_train, X_test, y_train, and y_test use X_train and y_train for machine learning predict/test on X_test and y_test I think there are two spots…

machine-learning unbalanced-classes resampling oversampling smote

asked Mar 25 '17 at 21:15

KubiK888

votes

1 answer

Can oversampling be moved outside stratified k-fold CV?

In a binary classification task, I am using imbalanced-learn's implementation of SMOTENC to oversample the positive class of a very imbalanced dataset. The total number of examples is very high, so that this oversampling takes quite a while. I would…

cross-validation scikit-learn unbalanced-classes stratification smote

asked Nov 05 '20 at 21:08

JDsallin

votes

1 answer

Oversampling methods for numerical data (regression)

There are many oversampling methods for categorical labels (for example SMOTE and Rose, etc.). But, are there oversampling method for numerical labels (the thing that I want to predict with my features), in the sense that it applies something…

regression machine-learning unbalanced-classes oversampling smote

asked Oct 19 '19 at 12:43

jennifer ruurs

votes

3 answers

Binary Classification in Imbalanced Data; Oversampling and Imputation

Together with two friends I participate in a university course about data mining in R and we chose the topic of bankruptcy prediction. We started with some "clean" data found on an "In class" kaggle competition and comparing to the leaderboard our…

unbalanced-classes data-imputation boosting oversampling smote

asked Jul 19 '19 at 15:13

Ferdinand Berr

vote

1 answer

Smote algorithm

When our dataset has 5 or more attributes, what will be the method of producing a new sample with Smote algorithm? How will the Euclidean distance with 5 or more attributes be calculated?

smote

asked Jan 17 '22 at 09:35

user346917

vote

1 answer

Getting different results when running SMOTE

I have this code which runs SMOTE and then getting roc_auc_score. The issue is that every I run the code on the same dataset, I get different results. How can I fix this? I need the same sample when ruining my code and the same results. The ROC…

machine-learning python scikit-learn oversampling smote

asked Jun 03 '21 at 18:03

Eliza

vote

0 answers

Designing an experiment to compare how multiple SMOTE variants affect multiple classification models on multiple datasets

For a university paper I want to test a hypothesis that one particular SMOTE variant outperforms two other SMOTE variants. By 'outperforms' I'm looking at using the F1 measure. I want to test this using multiple datasets (let's say 20 datasets)…

hypothesis-testing repeated-measures experiment-design smote

asked Apr 08 '21 at 19:06

Hoof-Hearted

vote

1 answer

A question about a logistic regression classifier performance (with and without resampling)

I am working on a dataset with 20 independent variables and 41188 instances. The task is a binary classification where the target variable has 36548 number of no's and 4640 of yes's. I have used logistic regression model with 10 folds of cross…

logistic classification unbalanced-classes scoring-rules smote

asked Dec 28 '20 at 19:00

Coder

2 3 4 Next