Questions tagged [under-sampling]

12 questions
1
vote
0 answers

Undersampling of datasets and training the model using early stopping

I need some clarification on the undersampling of datasets. I have 3 datasets. Undersampled train data, undersampled validation data, and test dataset which is not undersampled and is the true representation of the population. My questions are: I…
RH1
  • 21
  • 1
1
vote
1 answer

How R randomforest sampsize works?

I am working on a predictive model (imbalanced data) and trying to undersample the majority class data. I wanted to get the representative sample of my majority class and somehow came to know about R's RandomForest which has a parameter "sampsize".…
0
votes
0 answers

Threshold / Ratio to consider undersampling / oversampling

I have a classification task (predicting DNA methylation) with a somewhat unbalanced dataset - 38% of values are in the minority class, and the other 62% in the majority class. I have read that one way to work with unbalanced data is to do…
charelf
  • 171
  • 4
0
votes
0 answers

under sampling a multi-label dataset

I have a multi-label dataset, whose label distribution looks something like this, with label on x-axis and number of rows it occurs in the dataset in y-axis. ## imports import numpy as np import pandas as pd %matplotlib inline from sklearn.datasets…
0
votes
1 answer

Should models built using under-sampled data be evaluated against the population

I have a dataset of 11 mil. rows with a 1:10 ratio between minority and majority classes. To train a model, I have selected all the minority class members and 1/3 of the majority class. The ratio is now 3:10 and the sample data is comprised of…
0
votes
1 answer

Coefficient estimates of logistic regression after downsampling majority class

I used a binary logit model with a lasso regularization term to predict an unbalanced dataset, where I used undersampling on the minority class (2% of observations) to get a 50/50 split of the classes. Now I want to estimate the model coefficients,…
0
votes
0 answers

Best (quality/time) undersampling technique

I am working on a very unbilanced dataset (90% to 10%) with around 350.000 records, and am trying various classification methods. I bagan with SMOTE, which was quite fast, improved performance on tree classifiers (CART) but made it worse with all…
Mauro
  • 11
  • 3
0
votes
1 answer

Unbalanced dataset classification problem

I have a binary classification problem and I'm working with an unbalanced dataset. The count for each class in the training set looks like: Training set: Class 0: 29 cases Class 1: 6246 cases Test set: Class 0: 2678 cases Class 1: 12 cases I…
0
votes
1 answer

What are some "not so common" methods for dealing with unbalanced data?

When we talk about unbalanced data, we usually think about SMOTE, resampling and so on. Usually the methods mentioned here: https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets. What are others methods you've seem that are not…
Dumb ML
  • 197
  • 6
0
votes
0 answers

What's a range of good F1 scores?

I have watched a lot of videos on machine learning and in terms of F1 scores, all are different. One video says that an F1 score of .8 is bad, but another says an F1 score of .4 is excellent. What's up with this? I ran my model with Random Forest…
0
votes
0 answers

Limits of oversampling

I have a dataset with an event rate of less than 0.3 percent. To improve the modeling results, I did some oversampling using SMOTE. I initially oversampled so that the event rate increases 10 times to 3 percent. But that doesn't feel right. Are…
0
votes
0 answers

What causes the high OOB-error for randomForest() in R?

I'm trying to perform a random forest in R on a dataset with 16364 observations (after undersampling), using the function randomForest(). But my results look really weird: What could have caused this? My data was very unbalanced at first, why I…
AnnieFrannie
  • 139
  • 9