1

I am working on a predictive model (imbalanced data) and trying to undersample the majority class data. I wanted to get the representative sample of my majority class and somehow came to know about R's RandomForest which has a parameter "sampsize".

Can someone help me know how R's RandomForest subsamples the data? Maybe this can help solve my problem or maybe suggest me some other method?

I've tried getting centroid of the majority class data and undersampled my majority class by eliminating all the samples which are far away from this centroid of majority class but didn't get satisfactory results. I have around 50 features and working in python.

Amarpreet Singh
  • 505
  • 4
  • 15
  • 1
    Why is the class imbalance an issue for you? [Statistics sees little issue with class imbalance.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) – Dave Dec 02 '21 at 22:56

1 Answers1

0

This is one way to deal with imbalanced data in RF by using balanced data sets for each tree, even if the data is not balanced, which is made possible by bootstrap.

With sampsize you tell the model how many units to sample from each class. It is called with sampsize=c('0'=10,'1'=20) which means 10 units from the class '0' and 20 units from the class '1' (if you use different labels for the classes then change accordingly). With replace=T you tell the model to sample with replacement. So in this case it will sample 10 units from class 0 with replacement.

user2974951
  • 5,700
  • 2
  • 14
  • 27
  • Thanks for the answer. But that's not my question. I'm trying to ask how I can manually make my data unbiased without information loss so that it improves the model accuracy. As I'm working in python, the python RF package doesn't have "sampsize" parameter in it. – Amarpreet Singh Mar 30 '20 at 10:01
  • @AmarpreetSingh `How R randomforest sampsize works?` That's the title of your question and that is what I answered. Nowhere in your question did you mention `how I can manually make my data unbiased without information loss so that it improves the model accuracy`. Also nowhere did you mention that sou are using python. Your question is either misleading or you are wrong. – user2974951 Mar 30 '20 at 10:31
  • I have already mentioned: "Can someone help me know how R's RandomForest subsamples the data? Maybe this can help solve my problem or maybe suggest me some other method?" and in the last line: "I have around 50 features and working in python." Please don't answer by reading only the title. Kindly go through the whole description. – Amarpreet Singh Mar 30 '20 at 14:19