I'm trying to wrap my head around the concept of variable importance (for regression) from the randomForest
package in R. I'm trying to find a mathematical definition of how the importance measures are calculated, specifically the IncNodePurity
measure.
When I use ?importance
the randomForest
package states:
The second measure (i.e., IncNodePurity) is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
So, if I am interpreting it correctly, for regression, the measure is the total decrease in the residual sum of squares (RSS) after splitting on the variable.
Can anyone help me find a mathematical definition of this method, so I can help clarify this concept in my mind? I have searched quite a bit and although there are a lot of explanations on the internet, no one seems to define this method mathematically.
Would I be correct in saying that it is the difference in MSE measured both before and after a split? If the MSE is given by:
$MSE = \frac{1}{n}\sum_{i=0}^n(y_{i}-y_{i}^p)^2$
and $ \Delta i$ is the decrease from splitting:
$\Delta i = MSE_{before} - MSE_{after}$
The Impurity resulting from the split is recorded for all nodes (n) and all trees(T) would be given by something like:
$IMP = \sum_{T} \sum_{n} \Delta i(n,T)$
Im basing this on information I found that states that this importance measure is analogous to the Gini-index.
Some discussion relating this importance measure to MSE can be found here: In a random forest, is larger %IncMSE better or worse?