Random forest computing time in R

Question

I am using the party package in R with 10,000 rows and 34 features, and some factor features have more than 300 levels. The computing time is too long. (It has taken 3 hours so far and it hasn't finished yet.)

I want to know what elements have a big effect on the computing time of a random forest. Is it having factors with too many levels? Are there any optimized methods to improve the RF computing time?

score 71 · Answer 1 · edited Sep 08 '15 at 13:48

71

The overall complexity of RF is something like $\text{ntree}\cdot\text{mtry}\cdot(\text{# objects})\log( \text{# objects})$; if you want to speed your computations up, you can try the following:

Use randomForest instead of party, or, even better, ranger or Rborist (although both are not yet battle-tested).
Don't use formula, i.e. call randomForest(predictors,decision) instead of randomForest(decision~.,data=input).
Use do.trace argument to see the OOB error in real-time; this way you may detect that you can lower ntree.
About factors; RF (and all tree methods) try to find an optimal subset of levels thus scanning $2^\text{(# of levels-1)}$ possibilities; to this end it is rather naive this factor can give you so much information -- not to mention that randomForest won't eat factors with more than 32 levels. Maybe you can simply treat it as an ordered one (and thus equivalent to a normal, numeric variable for RF) or cluster it in some groups, splitting this one attribute into several?
Check if your computer haven't run out of RAM and it is using swap space. If so, buy a bigger computer.
Finally, you can extract some random subset of objects and make some initial experiments on this.

edited Sep 08 '15 at 13:48

Sycorax

76,417
20
189
313

answered Sep 16 '12 at 11:54

3

Thank you,I learn a lot from your answer and did a test as you said,besides, why the second suggestion works? – Chenghao Liu Sep 24 '12 at 17:55
5

@ChenghaoLiu Formulas were designed for small yet complex liner model frames, and thus they are inefficient when copying the set becomes expensive. – Sep 24 '12 at 21:47
2

Why does calling randomForest(predictors, decision) reducing running time? – JenSCDC Sep 27 '14 at 01:00
1

What's $\text{mtry}$? – wlad Sep 08 '15 at 11:53
2

@AndyBlankertz Formula interpretation in randomForest seem to lead to copying of the entire input. – Sep 08 '15 at 13:38
@NaN Number of randomly selected attributes tried at each split, a standard RF parameter. – Sep 08 '15 at 13:39
I can't find much material on `ranger` and `Arborist` on r-bloggers. Do you happen to have any blog posts or vignettes you could link to about these 2 packages? Edit: found one: http://www.rinfinance.com/agenda/2015/talk/MarkSeligman.pdf – Zach Sep 08 '15 at 13:55
a) running a version that can parallelize well is nice. I like the H2O randomforest for that. The 'h2o' library is decent on the hardware but it uses all the cores. b) sometimes you can pre-cull your columns by running just a few percent of your data. Your target can't be a needle in a haystack in this case. You can take a few thousand rows and evaluate variable importance and throw away the 80% uninformative columns. – EngrStudent Sep 09 '15 at 14:23
@Zach, you can read up more on ranger here: http://arxiv.org/abs/1508.04409. I just started using it - it's a few times faster than `randomForest` and OOB error rates are very close to the latter's, at least on my sample dataset. I too was looking to speed up RF when I found it. Hope this helps. – NoviceProg Sep 15 '15 at 16:45

score 14 · Answer 2 · edited Jan 11 '18 at 10:12

Because randomForest is a collection of independent carts trained upon a random subset of features and records it lends itself to parallelization. The combine() function in the randomForest package will stitch together independently trained forests. Here is a toy example. As @mpq 's answer states you should not use the formula notation, but pass in a dataframe/matrix of variables and a vector of outcomes. I shameless lifted these from the docs.

library("doMC")
library("randomForest")
data(iris)

registerDoMC(4) #number of cores on the machine
darkAndScaryForest <- foreach(y=seq(10), .combine=combine ) %dopar% {
   set.seed(y) # not really needed
   rf <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
}

I passed the randomForest combine function to the similarly named .combine parameter( which controls the function on the output of the loop. The down side is you get no OOB error rate or more tragically variable importance.

Edit:

After rereading the post I realize that I talk nothing about the 34+ factor issue. A wholey un-thought out answer could be to represent them as binary variables. That is each factor a column that is encoded 0/1 -level factor about its presence/non-presence. By doing some variable selection on unimportant factors and removing them you could keep you feature space from growing too too large.

Welcome to the site, @jdennison. This looks like a really nice contribution (although I really don't know too much about RFs & nothing about parallel computing). One note, the ordering of the answers can fluctuate over time, so it's best not to refer to "the answer above", but rather 'the answer by \@so-and-so' instead. — gung - Reinstate Monica, Sep 17 '12 at 16:35

score 3 · Answer 3 · answered Sep 16 '12 at 11:39

I can't speak to the speed of specific algorithms in R but it should be obvious what is causing long computing time. For each tree at each branch CART is looking form the best binary split. So for each of the 34 features it most look at the splits given by each of the levels of the variables. Multiply the run time for each split in a tree by the number of branches in the tree and then multiple that by the number of trees in the forest and you have a long running time. Who knows? Maybe even with a fast computer this could take years to finish?

The best way to speed things up I think would be to lump some of the levels together so that each variable is down to maybe 3 to 5 levels instead of as many as 300. Of course this depends on being able to do this without losing important information in your data.

After that maybe you could look to see if there is some clever algorithm that can speed up the search time for splitting at each node of the individual trees. it could be that at a particular tree the split search is a repeat of a search already done for a previous tree. So if you can save the solutions of the previous split decisions and identify when you are repeating maybe that strategy could save a little on computing time.

Thank you again ,i totally agree with you.And I try to reduce the levels number with a fake dummy method.For example,I replace a predictor with 600 levels with 4 predictors(as 600<5^4)After this transformation,I can run random forest algorithm.However,the RMSE result is strange,I will open two other question about how to reduce the level of factor feature and what's the relationship between 10-fold CV RMSE and test set RMSE score? — Chenghao Liu, Sep 24 '12 at 16:33

score 3 · Answer 4 · edited May 23 '17 at 12:39

3

I would suggest a couple of links:

1) Shrink number of levels of a factor variable is a link to a question on stackoverflow to deal with a similar issue while using the randomForest package. Specifically it deals with using only the most frequently occurring levels and assigning a new level to all other, less frequently occurring levels.

The idea for it came from here: 2009 KDD Cup Slow Challenge. The data for this competition had lots of factors with lots of levels and it discusses some of the methods they used to pare the data down from 50,000 rows by 15,000 columns to run on a 2-core/2GB RAM laptop.

My last suggestion would be to look at running the problem, as suggested above, in parallel on a hi-CPU Amazon EC2 instance.

edited May 23 '17 at 12:39

Community

1

answered Sep 28 '12 at 19:39

screechOwl

1,677
3
21
32

There's no **2)**. You should provide the important part of the page instead of relying entirely on the link. – A.L Jan 20 '15 at 22:36
I love how those EC instances run. Wow are they nice. I think the virtualized hardware is better than the real-thing. – EngrStudent Sep 09 '15 at 14:19

Random forest computing time in R

4 Answers4