Mathematical definition of the variable importance measure 'increase in node purity' from R randomForests package?

Question

I'm trying to wrap my head around the concept of variable importance (for regression) from the randomForest package in R. I'm trying to find a mathematical definition of how the importance measures are calculated, specifically the IncNodePurity measure.

When I use ?importance the randomForest package states:

The second measure (i.e., IncNodePurity) is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.

So, if I am interpreting it correctly, for regression, the measure is the total decrease in the residual sum of squares (RSS) after splitting on the variable.

Can anyone help me find a mathematical definition of this method, so I can help clarify this concept in my mind? I have searched quite a bit and although there are a lot of explanations on the internet, no one seems to define this method mathematically.

Would I be correct in saying that it is the difference in MSE measured both before and after a split? If the MSE is given by:

$MSE = \frac{1}{n}\sum_{i=0}^n(y_{i}-y_{i}^p)^2$

and $ \Delta i$ is the decrease from splitting:

$\Delta i = MSE_{before} - MSE_{after}$

The Impurity resulting from the split is recorded for all nodes (n) and all trees(T) would be given by something like:

$IMP = \sum_{T} \sum_{n} \Delta i(n,T)$

Im basing this on information I found that states that this importance measure is analogous to the Gini-index.

Some discussion relating this importance measure to MSE can be found here: In a random forest, is larger %IncMSE better or worse?

and here: Measures of variable importance in random forests

I wouldn't agree that my question is a duplicate because in the examples you provide, none of them give a mathematical definition of the `IncNodePurity` measure for regression. — Electrino, Aug 10 '19 at 12:18
Soren's answer in the link question is perfectly fine I think. `mse0` is the usual [MSE](https://en.wikipedia.org/wiki/Mean_squared_error) in the context of regression. It is just the case that the measure is not renamed as `IncNodeMSE` in the case of regression but aside that everything follows in the same way. For example see: [Breiman (2001)](https://link.springer.com/article/10.1023/A:1010933404324) in section 11 where it directly deals with MSE as the metric used for the generalisation error. — usεr11852, Aug 16 '19 at 23:11
Im slightly confused... are you saying that MSE is calculated before a split and again after a split... and the difference between them is summed over all splits for that variable, over all trees? — Electrino, Aug 17 '19 at 15:55
Yes, of course. How else would we know how much better we do after a partiular split? To that extent, certain boosting algorithm that employ regression as their base learners (e.g. LightGBM) have a "minimum gain to split" attribute when training exactly so they regularise splits that might overfit. — usεr11852, Aug 17 '19 at 19:19
Sorry for being explicit but would I be right in saying that the formula I give in the question is how `IncNodePurity` is calculated, except using MSE instead of RSS... and then summed over all splits and trees? — Electrino, Aug 17 '19 at 20:09
The formula provided is problematic because 1. it assumes a linear combination of features and 2. it does not distinguish between pre- and post-split. As it stands it should always be evaluate to 0. But yes, we need to account for all spits across all trees. — usεr11852, Aug 17 '19 at 20:17
I've edited the question, would the method I've outlined now be more in line with what is happening? — Electrino, Aug 18 '19 at 18:06

Mathematical definition of the variable importance measure 'increase in node purity' from R randomForests package?

0 Answers0