3

I have been trying to classify a set of data into one of four classes. The data has already been generated and I have set aside 10,000 for training and 2,000 for testing. I have also generated the labels for each of the data. Let's call the classes - 0,1,2 and 3.

Now when I observe the classification, I notice that there are a lot of 0s in the training data and hence in most cases, the classifier is just learning to predict 0 no matter what the features are. (I am using random forests for classification)

Generating the data again to ensure uniformity, takes a lot of time and I prefer to avoid that. Is there anyway I can still use the data that I have?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 7
    It's not evidence of bias that a model or procedure often predicts values that often occur. On the contrary, isn't that what you should want and expect? – Nick Cox Jul 07 '14 at 18:02
  • How your future data (other than test data) is distributed, and how that distribution is compared with test and train data? – rapaio Jul 07 '14 at 18:36
  • The test data and train data are randomly generated, so they follow the same distribution and I expect the future data to also follow the same distribution. – Anirudh Vemula Jul 09 '14 at 18:33
  • 4
    "Lots of zeros" is not evidence for bias. If you'd learn your classifier to predict lions on the North Poole you wouldn't expect it to predict many lions... That is your data. Maybe you should look for an algorithm that behaves better with this kind of data but this is a different problem. Write more on your data, so it is more clear *why* the zeros are a problem in this case. – Tim Jan 05 '15 at 15:01
  • Stratified sampling is one option, but you don't end up using all your data. Alternatively, you could update your loss function to proportionally add weight to less represented classes. – Alireza Jul 07 '14 at 16:13

2 Answers2

1

Another way is to oversample: "Oversampling: you duplicate the observations of the minority class to obtain a balanced dataset." [1]

But note that oversampling of the minority class may lead to overfitting, so be sure to test that.

You also may want to check this paper: Yap, Bee Wah, et al. "An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets." [2]

Alexey Grigorev
  • 8,147
  • 3
  • 26
  • 39
1

This is usually referred to as class imbalance or skewed data not bias.

For a random forest you can use roughly balanced bagging to resample the data used to grow each tree during the bagging process.

You can also look into using a weighted or cost sensitive criteria for tree growth like weighted gini or entropy. Note that weights should be tuned using a grid search or hyperparameter optimization as it is difficult to guess good ones. IE weighting the majority class more then the minority class may produced the best balanced error somewhat counterintuitively.

Finally heilinger distance decision trees have recently been proposed as less sensitive to this sort of things.

I wrote a random forest implementation that includes a bunch of different methods for imbalanced data.

Ryan Bressler
  • 2,147
  • 10
  • 15