Complexity of a random forest with respect to maximum depth

Question

I expect the training time of a random forest to be : linear in the number of trees (obviously), linear (or square-root) in the number of columns (depending on the choice for the subsampling of the columns) and, when it comes to the maximum depth of the trees... I am lost.

I thought that finding a split (inside a leaf) was linear in the number of elements in the leaf.

Therefore, increasing the maximum depth of a tree is just finding $n$ more splits, where $n$ was the previous number of leaves. The additional operation is ($n_i$ is the size of the leaf $L_i$) : $$\sum_{i=1}^n \text{split}(L_i)=\sum_{i=1}^n\theta(n_i)=\theta(n)$$

Which is more or less equivalent to finding the first split.

Therefore, I thought that the training time of a random forest was linear in the maximum depth. However the numerical example shows (performed with sk-learn) that I am completely wrong (hopefully I will be able to post the depth 20 timings):

TIME 0.15m (1-fold)
{'n_estimators': 10, 'max_depth': 5}

TIME 0.60m (1-fold)
{'n_estimators': 50, 'max_depth': 5}

TIME 0.69m (1-fold)
{'n_estimators': 10,'max_depth': 10}

TIME 3.18m (1-fold)
{'n_estimators': 50, 'max_depth': 10}

TIME 3.85m (1-fold)
{'n_estimators': 10, 'max_depth': 15}

TIME 15.59m (1-fold)
{'n_estimators': 50, 'max_depth': 15}

Soren Havelund Welling · Answer 1 · 2016-10-11T13:12:22.423

For smaller data sets as simulated below the process should be linear. As pointed out by @EngrStudent, it may be an issue of L1, L2 and RAM clock speed. As model complexity increases the random forest algorithm probably cannot compute the entire tree(...or sub branch of tree) in L1 and/or L2 cache.

I tried to run a similar test with R randomForest, where in fact it seems to be linear. I cannot choose maxdepth in randomForest but only max terminal nodes (maxnodes), but that's effectively the same.

max terminal nodes = $2^{(maxdepth-1)}$.

Notice I plot maxnodes (1,2,4,8,16,32,64) by a log scale, and then depth (0,1,2,3,4,5,6) is plotted linearly by x axis. Time consumption appear to increase linearly with depth.

library(randomForest)
library(ggplot2)
set.seed(1)

#make some data
vars=10
obs = 4000
X = data.frame(replicate(vars,rnorm(obs)))
y = with(X, X1+sin(X2*2*pi)+X3*X4)

#wrapper function to time a model
time_model = function(model_function,...) {
  this_time = system.time({this_model_obj = do.call(model_function,list(...))})
  this_time['elapsed']
}

#generate jobs to simulate, jobs are sets of parameters (pars)
fixed_pars = alist(model_function=randomForest,x=X,y=y) #unevaluated to save memory
iter_pars = list(maxnodes=c(1,2,4,8,16,32,64),ntree = c(10,25,50),rep=c(1:5))
iter_pars_matrix = do.call(expand.grid,iter_pars)

#combine fixed and iterative pars and shape as list of jobs
job_list = apply(iter_pars_matrix,1,c,fixed_pars)

#do jobs and collect results in a data.frame
times = sapply(job_list,function(aJob) do.call(time_model,aJob))
r_df = data.frame(times,iter_pars_matrix)

#plot the results
ggplot(r_df, aes (x = maxnodes,y = times,colour = factor(ntree))) +
geom_point() + scale_x_log10()

Don't forget hardware. If you transition from L2 to RAM there is going to be a jump in latency. If you transition from RAM to disk-cache then you get another jump. — EngrStudent, Oct 11 '16 at 12:53
you might try running it using the 'h2o' library, with 'Flow' web-interface. They use java, but are decent about sharing across cores and reporting execution times. — EngrStudent, Oct 11 '16 at 17:14

Complexity of a random forest with respect to maximum depth

1 Answers1