Random Forest Regression - Coping with extreme values

Question

I'm not sure if I used the concept "extreme values" right. Anyhow, I'm trying to produce a model that estimates maximum tree heights / $\text{km}^2$. I have a database of around 24000 points ($\text{km}^2$), each has the max tree height value and 33 predictors. After playing around with random forest I manage to achieve a correlation of 0.67 between the real height and the estimated height on the test sample (20%). A MSE of around 1.6 meters. But Maximum errors of up to 33 meters. What I can see is that patches with very tall trees or very short trees (50 meters - 1 meters) are out of the scope of the model. Thinking in linear regression it is analogous to losing prediction power as you move away from the center of gravity of the observations. Right? How can I cope with this if at all?

p.s. this was implemented in R

might it be a case of overfitting? try to reduce tree depth. Have a look at Crimisini et al. Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning — Simone, Dec 12 '12 at 02:32
Does the training set have patches with very tall or very short trees? What does the model predict for those patches? — David J. Harris, Dec 12 '12 at 02:35
Yes it does. On the training set the model predicts those patches very well (not suprised). But those types of patches on the test set are exactly the ones that have "big" errors. I don't think it could be a problem with overfitting but I will try to play with the node parameter and get back to you. — JEquihua, Dec 12 '12 at 02:58

Random Forest Regression - Coping with extreme values

0 Answers0