Why permuting a predictor gives a measure of the importance of the variable?

Question

I am reading the vignette for the R package randomForestExplainer.

This package allows us the compute the importance of variables in a random forest model.

The result of the function accuracy_decrease (classification) is defined as

mean decrease of prediction accuracy after X_j is permuted

Questions:

What is the point of permuting the predictor?
Why permuting the predictor changes the accuracy?
Why the change in the accuracy when we permute the predictor gives us a measure of the importance of the variable?

Related question

score 4 · Accepted Answer · edited Apr 16 '21 at 11:44

A way to gauge, how useful a predictor $x_j$ is within a given model $M$ is by comparing the performance of the model $M$ with and without a predictor $x_j$ being included (say model $M^{-x_j}$). If we have multiple predictors though we are face with a situation we would have to create $p$ different $M^{-x_j}$ models going back and forth. The cost of this re-training procedure quickly becomes prohibitively high.

The point of permuting a predictor is to approximate the situation where we use the model $M$ to do a prediction but we do not have the information for $x_j$. Scrambling should destroy all (ordering) information in $x_j$ so we will land in situation where $x_j$ is artificially corrupted. We can then compare the performance of our model $M$ when using the pristine estimator $x_j$ and the performance of model $M$ when using the scrambled version; this allows to approximate what would happen if we had little to no information about $x_j$ without having to retrain a model $M^{-x_j}$.

So to recap and answer your questions above:

Scrambling, corrupts the information of a predictor $x_j$ and thus allows us to treat this as if $x_j$ information is missing.
Trees (the archetypical base learners for random forests) are strongly reliant to the ordering induced by an explanatory variable $x_j$ when making a prediction. By permuting $x_j$ we feed no (or out-right wrong) information about $x_j$ in our random forest model $M$ when making predictions so we should see a knock on performance. If we saw no performance difference it would be strongly indicative that $x_j$ is not really used.
It is an approximation of variable importance. The mental rule-of-thumb reasoning is that "the more important a variable is the more impactful should be in the model performance". Of course this is an working assumption; there are a number of little things that can go wrong (see last discussion below) but it is not unfounded.

Notice that permutation importance does break down in situations that we have correlated predictors and give spurious results (e.g. see the Nicodemus et al. (2010) The behaviour of random forest permutation-based variable importance measures under predictor correlation for a more in depth discussion.) I would suggest not relying on a single variable importance performance metric. For example, we can easily compute importance based on the relative gains and on the number of variable is used for splits as well as look at SHAP-based variables importances. This can give us a more holistic view. To paraphrase a great one: "all importance metrics are wrong but some are useful". A more recent exposition can be found in Please Stop Permuting Features: An Explanation and Alternatives (2019) by Hooker and Mentch (but it is not yet formally peer-reviewed).

I still don't understand how re-training the model with the permuted variable is faster then re-training the model without the variable — robertspierre, Apr 16 '21 at 10:11
We do not (usually) re-train but rather predict using the permuted feature $x_j$ while keeping all other features. Please note that I only refer to the use of model $M$ in my second paragraph and not to $M^{-x_j}$. I will amend point 2. — usεr11852, Apr 16 '21 at 11:11
I believe for some of the simpler methods there are identities that speed up the recompute. The Woodbury identity comes to mind. — EngrStudent, Apr 16 '21 at 11:47
Wait what? :D The Woodbury would be relevant if we did matrix inversions. I guess depending we might have some when evaluating potential splits' entropy but that's a bit far fetched... — usεr11852, Apr 16 '21 at 11:55

Why permuting a predictor gives a measure of the importance of the variable?

1 Answers1