How to choose the best algorithm for measuring attribute importance/relevance?

Question

Let's say we want to conclude that attributes A, B, C and D are the most relevant attributes to maximize the precision of predicting Y, and then rank those attributes based on their importance/relevance.

Now let's say SVM and Random Forest both seem to be good fits to model the data and Random Forest provides better performance (higher precision). Does that mean Random Forest would also be a better choice to rank the attributes according to their relevance to D?

In a more general sense, can we say the best algorithm for maximizing precision of predicting Y is also the best algorithm to rank the relevance of attributes to Y?

All variable importance measures tell you only how a specific *algorithm* used your data. When you report variable importance, you should keep this in mind: "This is a summary of how the algorithm I chose used my training data". There is no concept of variable importance that is algorithm independent that is being measured. Consequently, the variable importance measures you get from different algorithms are not comparable to eachother. — Matthew Drury, Dec 13 '16 at 19:13
@MatthewDrury then if the goal is to rank attribute relevance, could somehow averaging the weight of multiple algorithms be a good approach? Like if both SVM and RF are saying `A` is the most relevant, could we say `A` is potentially the most relevant? How do you compare this approach to a simple correlation test between `A` and `Y`? — Aliweb, Dec 13 '16 at 19:17
As the poster of this question: http://stats.stackexchange.com/questions/202277/what-are-variable-importance-rankings-useful-for I'm not sure I'm the best person to help you. I've stopped believing that it's even possible to rank the influence of predictors in a model independent way. — Matthew Drury, Dec 13 '16 at 19:52

gazza89 · Answer 1 · 2021-06-03T18:04:47.750

Feature Importance is generally speaking not uniquely defined, even though we all have some sort of intuitive understanding of what it means. The original poster talks about notions of improving the precision/accuracy of predicting Y.

Shapley Values

Let us assume (without any loss of generality, but to help us think about the problem clearly), that all features are contributing to accuracy in a measurably positive way (often this is not the case. If a feature contains zero or very little information, it might increase the model's capacity to overfit more than it adds value and you're better off removing it).

Then, intuitively, you might ask "how much accuracy do I lose if I remove any one of the features?" and surely the one feature which, upon removal, makes your accuracy go down the most, is the most important one.

If you do this however, you run into some slightly unintuitive issues. If features A and B are highly correlated, removing either might have very little effect on the classifier's accuracy, but removing both might be catastrophic. This is where Game Theorists come in and say that you should use Shapley values. The analogy is as follows: Let's say you have balanced classes, but your ML algorithm gets a test accuracy of 85%. This means that it's adding 35% points to a baseline. There are thus 35 points of reward to be shared amongst the features you have. Shapley Values tell you how to fairly share the profits of a company amongst your employees. They require you to calculate how much money the company would have made, had all possible "coalitions" of the workforce been at the company (basically removing all combinations, from 1 to all employees from the workforce, there are $2^N$ such coalitions). In the ML case, you look at removing all combinations of features, and you see where (presumably somewhere between 50% and 85%) you land in test accuracy.

So Shapley values will give you a "fair" view of how much each feature is contributing.

Comparing Shapley Values or Feature Importances of Different Models

Now, to your second question, which I will paraphrase here: I've trained two models, one performs better than the other. Does that mean that the second model is a better model to use to measure feature importance?

An interesting question for sure. Shapley Values are model agnostic, you can apply the procedure (in theory anyway) to any model. The Shapley Values do however not tell you deep truths about the data, they tell you about the model. If feature X has a high Shapley Value according to model A and a lower one according to model B, that simply means that "model A is making more use of feature X to improve its accuracy than model B is".

Feature Importance Vs Causation

This brings us to a general point about feature importance. People frequently ask about it, people have a general intuitive view of what they mean by it, but when you drill into it, people are often looking for an answer that standard supervised learning cannot provide.

Your original question implies there is some "true" feature importance out there, which is model independent. Consider the following:

Model 1 and Model 2 predict the weather tomorrow. They both have access to a tonne of data. Model 1 decides that the most important features are today's weather, and some long term seasonal average data. Model 2 decides that the most important features are the wind direction, and the current weather conditions in a few places a few hundred miles away. They get similar accuracy, so which do you trust?

Model 1 is making no attempt to get at the causal mechanism, it's saying something like "I'm going to predict the seasonal average, making adjustments for today's weather" whereas Model 2 is saying "check the wind direction and see what's coming from us upwind". Clearly the latter is much closer at getting to the true cause, would you thus trust its feature importance more? What about if Model 1 is more accurate? Do you change your mind and say its feature importances are the true causal mechanisms?

Generally speaking, in ML prediction problems where you have no intuition about the causal mechanisms, it's not correct (albeit it's tempting, especially when under pressure from non domain experts) to interpret feature importances in a causal way. Don't do it, it's almost invariably incorrect to do so.

Is Feature Importance a Useful Concept at all?

Feature Importances are specific to a model, they don't necessarily give you general properties about the mechanisms which generate the data. They tell you how a model does what it does, not how nature does what it does.

So with causality out the window, what else can they be good for? Well, my personal opinion is that they're not that useful at all. The main use case, is to show them to end users of ML systems who are often not domain experts, in order to make your work slightly less black box and build trust. This can sometimes backfire, as they will interpret them with a causal lens, but clearly non-causal features can have high feature importances.

Low feature importance can be a useful indicator, it could tell you, before deploying your POC model, not to worry about certain features, and you could save yourself building production data pipelines for some features. Still, low feature importance is sufficient but not necessary. With most feature importance methods, you can have a feature with a high feature importance, which if removed, would still not make much of a difference, as other features would assume its place. If you want to use feature importance to learn which features you could do without, you probably can't do much better in practice, than removing features one by 1 and seeing which makes the most difference (this scales like $K^2$, where K is the number of features)...unless you have enough compute power to investigate all possible coalitions of features, see which ones perform well, and then decide on tradeoffs between performance and difficult of deployment (this scales like $2^K$)

One final use for feature importance, is when debugging. If you're getting weird results, it's good to know which features are contributing the most, these are the ones you should investigate first for possible corrupted data/buggy input pipelines.

score 0 · Answer 2 · answered Feb 01 '17 at 19:10

0

It is generally recommended to use more than one method for feature selection. For instance, taking a combination of outputs from Recursive feature elimination and Random forests can be effective to select final set of features. There is no general technique for feature selection. A list of possible feature selection techniques along with examples can be found here

answered Feb 01 '17 at 19:10

prashanth

3,747
4
21
33

1

But the Q is about measuring importance, not selection – kjetil b halvorsen Mar 15 '19 at 22:24

How to choose the best algorithm for measuring attribute importance/relevance?

2 Answers2

Linked