Recursive partitioning using median (instead of mean)

Question

As I am only familiar with the basics regarding decision trees I would like to ask, with the risk of stating silly question: Is it possible to perform recursive partitioning with the group median as the response/objective?

For example, in stead of R's rpart() using means, could a similar tree be created with medians?

I want to do this because the continuous dependent variable I want to examine has a series of outliers that clearly affect the mean values (especially when the number of observations in each node gets smaller). Am I on the right track, or should I be using other kind of methods? Would preprocessing the data be another alternative (perhaps "capping" the values at a upper limit)?

[This](http://stats.stackexchange.com/questions/2410/can-cart-models-be-made-robust) question is probably relevant, as is the fact that you can specify your own splitting criteria in `rpart`. See the discussion of the `method` argument in `?rpart`. — joran, Aug 25 '11 at 20:23
@joran: interesting, does anybody have experience with editing this `method`? — Figaro, Aug 26 '11 at 06:46

score 4 · Accepted Answer · answered Aug 26 '11 at 00:49

4

You could also pre-process your data, using a transformation like the spatial sign transformation, or the rank-order transformation to minimize the impact of outliers.

answered Aug 26 '11 at 00:49

Zach

22,308
18
114
158

1

Does a spatial sign transformation really remove outliers? – Figaro Aug 26 '11 at 21:10
2

@Figaro I believe it greatly reduces their affect. http://www.ncbi.nlm.nih.gov/pubmed/16711760 – Zach Aug 26 '11 at 21:13

score 3 · Answer 2 · answered Aug 26 '11 at 12:23

3

Although I have never used it, the quantregForest package seems to do what you want.

Here is the description:

Quantile Regression Forests is a tree-based ensemble method for estimation of conditional quantiles. It is particularly well suited for high-dimensional data. Predictor variables of mixed classes can be handled. The package is dependent on the package randomForests, written by Andy Liaw.

There is also an article accompanying the quantregForest package:

Meinshausen N (2006). “Quantile Regression Forests.” Journal of Machine Learning Research, 7, 983–999.

answered Aug 26 '11 at 12:23

Johannes

1,158
1
12
13

What in the description makes you say that? Is it that random forest methods are generally more stable or is there something specific about this approach that makes it more suitable? – Figaro Aug 26 '11 at 21:17
@Figaro Typically, we model the mean of the response; quantile regression methods allow you to model, well, quantiles (e.g. the median), which generally entails a somewhat more robust loss function than squared error. – joran Aug 27 '11 at 20:06

score 1 · Answer 3 · answered Jul 13 '12 at 22:06

1

In addition to Johannes's suggestion about quantregForest, there is also an R package called gbm (generalized boosted machine) which uses trees to calculate conditional quantiles.

answered Jul 13 '12 at 22:06

Andrew

656
5
11

Recursive partitioning using median (instead of mean)

3 Answers3