-1

Currently I am working on the data mining project and I am using RandomForest classification model for that. I have a few queries in that.

  1. Will the RandomForest handle if there is correlation between the prediction attributes given while building the model . Or should we only given the independent attributes to the model ?

  2. Will it affect the performance if we give many predictors to the model ?

Thanks in advance.

Prabhuraj
  • 7
  • 1
  • 3
    This is not a programming question. – Rich Scriven Oct 17 '14 at 03:35
  • I agree, should be moved to cross validated... – Alex Oct 17 '14 at 03:40
  • Please review what's considered [on-topic](http://stackoverflow.com/help/on-topic) for Stack Overflow. This generally does not include discussing statistical methods. The correct place for that would be [stats.se]. –  Oct 17 '14 at 03:59

2 Answers2

1

To help answering your questions let me quote a nice explanation of the training process (taken from here):

  1. Sample N cases at random with replacement to create a subset of the data. The subset should be about 66% of the total set.

  2. At each node:

  • For some number m (see below), m predictor variables are selected at random from all the predictor variables

  • The predictor variable that provides the best split, according to some objective function, is used to do a binary split on that node.

  • At the next node, choose another m variables at random from all predictor variables and do the same.

Depending upon the value of m, there are three slightly different systems:

  • Random splitter selection: m =1
  • Breiman’s bagger: m = total number of predictor variables
  • Random forest: m << number of predictor variables. Breiman suggests three possible values for m: ½√m, √m, and 2√m

So going back to your questions:

1. The correlation won't really matter because, depending on the chosen system, the algorithm will look at either one variable at a time or choose the best 'splitter' from the subset.

2. Not sure what you mean by performance here... If you mean the 'speed' of the algorithm then you can figure out the performance affect by looking at the process described above (how and how many variables you choose for each node).

Now if by performance you meant the accuracy of the model then, in general, the more the predictors and the more independent they are the better, also because correlating them 'artificially' may lead to better results if initially the results are not satisfactory.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Icki
  • 11
  • 1
1

Random forests can handle noisy features as well as correllated ones. In general, a random forest automatically learns which features helped seperating the prediction classes the most. Just train a forest in R and retrieve the variable importance measures like the mean decrease in gini index. This can help you determine the strongest features. Of course you can always do a previous feature selection step. There are suggestions here: Feature Selection Packages in R, which do both regression and classification

But in case you have less than 100 features i would guess that you do not need to put too much efford in it, as the random forests are kind of insusceptible for noisy or redundant features.