randomForest in R for classification

Question

I am trying to use a random forest model in R

RF<- randomForest(as.factor(y)~., data=train, importance=TRUE, proximity=FALSE, 
                                              ntree=1000, keep.forest=TRUE)

Y is binary. Model runs without any issues. However i have some questions

I have about 50,000 rows of data and 300 variables. As per other theards using proximity=TRUE will likely run very slow or requires a lot of memory.

Using proximity=TRUE will it give me different results if memory was not an issue ?
I am interested in probability of predicting my 1s as my final outcome. when i do predicted=predict(RF)..i get 1's and 0s are my prediction. I would rather like probabilites.
How do we implement a random forest model ? I have 1000 trees. If i want to implement this model real time, how do we code this ? Do we code 1000 decision trees to make prediction for each new case ?

Thank you.

You might find this answer helpful: http://stats.stackexchange.com/questions/21152/obtaining-knowledge-from-a-random-forest — user3490, Jul 14 '13 at 19:32
The selected answer at that link is somewhat misleading. While you do get many diagnostic tools with RF, you can only get a plot like the one shown if your RF is fitted using only stumps. This is because the plot assumes the model is additive in the predictors, which will not be true if the base trees are more complex. — Hong Ooi, Jul 15 '13 at 03:18

Hong Ooi · Answer 1 · 2013-07-14T16:23:29.087

6

No, changing the proximity argument won't change the result per se. However, since RF is a stochastic algorithm you'll get a different model if you run it again, because its random draws will be different. If you don't want this behaviour, be sure to set the random seed with set.seed before each run of randomForest.
To get predicted probabilities, use type="prob" in your call to predict. You'll get a matrix, which is the probability of each case being a 0 or a 1.
Yes, potentially you have to code 1000 individual trees if you're re-implementing the model logic elsewhere. This is where in-database implementations of R are very handy: you can actually run your RF model in the database (or import it), and keep the fitted model object around.

edited Jul 14 '13 at 16:23

answered Jul 14 '13 at 16:13

Hong Ooi

7,629
3
29
52

Thank you very much for your help. For item 3 do we have any example anywhere I can look at ? How do we even save the 1000 trees ? and how can we look at what is inside those 1000 trees ? what is in-database implementation ? any example would greatly help me implement this model. Thank you – user16789 Jul 14 '13 at 19:15
An example for point 3 is available here: http://stackoverflow.com/questions/7863942/save-a-random-forest-object – user3490 Jul 14 '13 at 19:37

randomForest in R for classification

1 Answers1