Questions tagged [unbalanced-classes]

Data organized into discrete categories or *classes* may present problems for certain analyses if the number of observations ($n$) belonging to each class is not constant across classes. Classes with unequal $n$ are *unbalanced*.

Data organized into discrete categories or classes may present problems for certain analyses if the number of observations ($n$) belonging to each class is not constant across classes. Classes with unequal $n$ are unbalanced. This tag should be used for questions about datasets with subsamples of unequal size where imbalanced distributions across categorical factors is of concern.

Analyses with known, non-negligible sensitivity to unbalanced classes include (but are not limited to):

Reference

Howell, D. C. (2009). Unequal sample sizes do matter. University of Vermont. Retrieved from http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Unequal-ns/unequal-ns.html.

915 questions
107
votes
3 answers

Does an unbalanced sample matter when doing logistic regression?

Okay, so I think I have a decent enough sample, taking into account the 20:1 rule of thumb: a fairly large sample (N=374) for a total of 7 candidate predictor variables. My problem is the following: whatever set of predictor variables I use, the…
Michiel
  • 1,173
  • 3
  • 8
  • 5
88
votes
8 answers

When is unbalanced data really a problem in Machine Learning?

We already had multiple questions about unbalanced data when using logistic regression, SVM, decision trees, bagging and a number of other similar questions, what makes it a very popular topic! Unfortunately, each of the questions seems to be…
Tim
  • 108,699
  • 20
  • 212
  • 390
66
votes
1 answer

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

TL;DR See title. Motivation I am hoping for a canonical answer along the lines of "(1) No, (2) Not applicable, because (1)", which we can use to close many wrong questions about unbalanced datasets and oversampling. I would be quite as happy to be…
Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
59
votes
7 answers

Binary classification with strongly unbalanced classes

I have a data set in the form of (features, binary output 0 or 1), but 1 happens pretty rarely, so just by always predicting 0, I get accuracy between 70% and 90% (depending on the particular data I look at). The ML methods give me about the same…
56
votes
5 answers

Training a decision tree against unbalanced data

I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy. The data consists of students studying courses, and the class variable is the…
chrisb
  • 715
  • 1
  • 7
  • 8
56
votes
4 answers

What is the proper usage of scale_pos_weight in xgboost for imbalanced datasets?

I have a very imbalanced dataset. I'm trying to follow the tuning advice and use scale_pos_weight but not sure how should I tune it. I can see that RegLossObj.GetGradient does: if (info.labels[i] == 1.0f) w *= param_.scale_pos_weight so a gradient…
ihadanny
  • 2,596
  • 3
  • 19
  • 31
53
votes
4 answers

Class imbalance in Supervised Machine Learning

This is a question in general, not specific to any method or data set. How do we deal with a class imbalance problem in Supervised Machine learning where the number of 0 is around 90% and number of 1 is around 10% in your dataset.How do we optimally…
NG_21
  • 1,436
  • 3
  • 17
  • 25
50
votes
3 answers

What is the root cause of the class imbalance problem?

I've been thinking a lot about the "class imbalance problem" in machine/statistical learning lately, and am drawing ever deeper into a feeling that I just don't understand what is going on. First let me define (or attempt to) define my terms: The…
43
votes
4 answers

When should I balance classes in a training data set?

I had an online course, where I learned, that unbalanced classes in the training data might lead to problems, because classification algorithms go for the majority rule, as it gives good results if the unbalance is too much. In an assignment one had…
41
votes
1 answer

Does down-sampling change logistic regression coefficients?

If I have a dataset with a very rare positive class, and I down-sample the negative class, then perform a logistic regression, do I need to adjust the regression coefficients to reflect the fact that I changed the prevalence of the positive…
Zach
  • 22,308
  • 18
  • 114
  • 158
35
votes
6 answers

Sampling for Imbalanced Data in Regression

There have been good questions on handling imbalanced data in the classification context, but I am wondering what people do to sample for regression. Say the problem domain is very sensitive to the sign but only somewhat sensitive to the magnitude…
someben
  • 738
  • 1
  • 6
  • 11
35
votes
3 answers

Classification/evaluation metrics for highly imbalanced data

I deal with a fraud detection (credit-scoring-like) problem. As such there is a highly imbalanced relation between fraudulent and non-fraudulent observations. http://blog.revolutionanalytics.com/2016/03/com_class_eval_metrics_r.html provides a great…
33
votes
4 answers

Optimising for Precision-Recall curves under class imbalance

I have a classification task where I have a number of predictors (one of which is the most informative), and I am using the MARS model to construct my classifier (I am interested in any simple model, and using glms for illustrative purposes would be…
32
votes
5 answers

What problem does oversampling, undersampling, and SMOTE solve?

In a recent, well recieved, question, Tim asks when is unbalanced data really a problem in Machine Learning? The premise of the question is that there is a lot of machine learning literature discussing class balance and the problem of imbalanced…
32
votes
6 answers

Sample size for logistic regression?

I want to make a logistic model from my survey data. It is a small survey of four residential colonies in which only 154 respondents were interviewed. My dependent variable is "satisfactory transition to work". I found that, of the 154 respondents,…
1
2 3
60 61