6

As I am only familiar with the basics regarding decision trees I would like to ask, with the risk of stating silly question: Is it possible to perform recursive partitioning with the group median as the response/objective?

For example, in stead of R's rpart() using means, could a similar tree be created with medians?

I want to do this because the continuous dependent variable I want to examine has a series of outliers that clearly affect the mean values (especially when the number of observations in each node gets smaller). Am I on the right track, or should I be using other kind of methods? Would preprocessing the data be another alternative (perhaps "capping" the values at a upper limit)?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Figaro
  • 1,042
  • 2
  • 12
  • 24
  • 1
    [This](http://stats.stackexchange.com/questions/2410/can-cart-models-be-made-robust) question is probably relevant, as is the fact that you can specify your own splitting criteria in `rpart`. See the discussion of the `method` argument in `?rpart`. – joran Aug 25 '11 at 20:23
  • @joran: interesting, does anybody have experience with editing this `method`? – Figaro Aug 26 '11 at 06:46

3 Answers3

4

You could also pre-process your data, using a transformation like the spatial sign transformation, or the rank-order transformation to minimize the impact of outliers.

Zach
  • 22,308
  • 18
  • 114
  • 158
3

Although I have never used it, the quantregForest package seems to do what you want.

Here is the description:

Quantile Regression Forests is a tree-based ensemble method for estimation of conditional quantiles. It is particularly well suited for high-dimensional data. Predictor variables of mixed classes can be handled. The package is dependent on the package randomForests, written by Andy Liaw.

There is also an article accompanying the quantregForest package:

Meinshausen N (2006). “Quantile Regression Forests.” Journal of Machine Learning Research, 7, 983–999.

Johannes
  • 1,158
  • 1
  • 12
  • 13
  • What in the description makes you say that? Is it that random forest methods are generally more stable or is there something specific about this approach that makes it more suitable? – Figaro Aug 26 '11 at 21:17
  • @Figaro Typically, we model the mean of the response; quantile regression methods allow you to model, well, quantiles (e.g. the median), which generally entails a somewhat more robust loss function than squared error. – joran Aug 27 '11 at 20:06
1

In addition to Johannes's suggestion about quantregForest, there is also an R package called gbm (generalized boosted machine) which uses trees to calculate conditional quantiles.

Andrew
  • 656
  • 5
  • 11