1

I would like to build an ensemble classifier (possibly boosting) on a huge training dataset (>> 1e7 examples) where the proportion of positive examples is around 5%. And what I am interested in are recall and precision of positive class.

If I train on the whole dataset, does the small proportion of positive examples affect the performance of the classifier? I'm not sure because the absolute number of positive examples is big. Do I need to randomly select number of negative examples so the number of positive examples is around the same as the negative examples?

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
S.Wang
  • 65
  • 1
  • 5
  • 3
    A 5% imbalance will cause no problems for machine learning algorithms that attempt to predict probabilities (logistic regression, random forest, gradient boosting, neural networks). You will want to use one of these models, and then tune a classification threshold on the resulting probabilities to achieve the goals for your decision rule. – Matthew Drury Jun 15 '18 at 22:44

1 Answers1

1

Simplest way to achieve this is to clone positive examples 19+ times so you get the same amount of positive and negative examples. Depending on algorithm you use for training, you may achieve the same behavior by adding more weight to the positive examples so algorithm will penalize false negatives more than false positives.

Ivan Kovtun
  • 192
  • 2
  • 1
    Why not instead leave the data as is and change the classification threshold to achieve the desired decision rule metrics? What's the reason you recommend a more complicated procedure in this case? – Matthew Drury Jun 15 '18 at 22:46
  • Sometimes just changing threshold helps, but in general you get better results with the weighted training data because you clearly state what you want to optimize here. The question is which function you training algorithm optimizes. If it it optimizes amount of misclassified training examples than you get biased result toward bigger training set. Changing threshold in this case does not guarantee that result become optimal. In your case if algorithm decides to assign everything to the second class you already get 95% precision. Final results with changed threshold may be essentially random. – Ivan Kovtun Jun 16 '18 at 04:42