7

I am looking for a comparison of different regression tree node splitting approaches within the random forest framework. I am looking at the trade-off between ensemble accuracy/reliability (holding forest size constant) and computational complexity of the split since I deal with large datasets.

The standard approach is to minimise $$\sum_{i \in \text{Region}_1} (y_i - \text{mean}(y_i)_{\text{Region}_1})^2 +\sum_{i \in \text{Region}_2} (y_i - \text{mean}(y_i)_{\text{Region}_2})^2$$ But another approach I have seen in a PhD thesis which drastically reduces the computational complexity is to look at the average of $$\frac{(\sum_{i \in \text{Region}_1} y_i)^2}{(\text{Cardinality of Region}_1)}$$ versus the same for $\text{Region}_2$; this allows me to use a running sum of $y_i$.

hellpanderrr
  • 543
  • 2
  • 6
  • 15
Jase
  • 1,904
  • 3
  • 20
  • 33

1 Answers1

7

You can compute variance with the same cost as the averages. Simply put you have to use an on-line version of algorithm for computing variance. It is crystal clear explained on an article of John D. Cook. I used myself this online computing for a regression tree - split numeric method where I use an online computation of variance. I used to compute 2 running statistics, one starting from left and another one starting from right. It is possible to do that in a single step, I considered however that multiplication by a constant, gives also linear time and that does not bother me.

For more information you can take a look on Wikipedia also.

rapaio
  • 6,394
  • 25
  • 45
  • Very nice. What do you mean by 'constant time'. Wouldn't it be linear time? – Jase Feb 05 '14 at 07:34
  • Sorry, I was meaning multiplication by a constant time, since I compute 2 running statistics. I will correct that in the answer. – rapaio Feb 05 '14 at 07:39
  • How do you handle the large amounts of noise in the variance estimate in the first few iterations (whether starting from front or back of the vector)? – Jase Feb 06 '14 at 15:12
  • @Jase I do not have an answer. What I usually do in practice is to adjust the parameter which decides the minimum number of instances in a node. If there is plenty of data, then is no problem to have a large value for this minimum. If it is not, then I try to take a look at my data and understand from the underlying distribution which would be a proper value for this parameter. I imagine that one could automate that by trying to estimate a confidence interval for the variance estimation, but I never tried to work on that. – rapaio Feb 06 '14 at 16:07
  • The 'regression tree.' link is invalid. – tidy Jan 29 '15 at 03:37
  • @honeytidy problem solved – rapaio Jan 29 '15 at 07:40