0

I have an unbalanced dataset. Let's say I have 500 positives and 50,000 negatives. Can I deal with this by randomly choose 300 out of 500 positives and also 300 out of 50,000 negatives? Does this suffer from the problem of implicitly choosing Ys that have different Xs, which is described in this article: https://gking.harvard.edu/files/0s.pdf

TimD
  • 41
  • 1
  • Please tell us why you think unbalanced classes is a problem, and read https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning – kjetil b halvorsen Oct 23 '19 at 10:00
  • My problem is that I have 70% accuracy for the minority but only 50% accuracy for the majority. Which means that overall I expect to make 500*0.7 + 49,500*0.5 = 25,100 correct predictions out of 50,000 predictions, which seems random. I'm trying to fix this somehow by subsampling. – TimD Oct 23 '19 at 13:13

0 Answers0