1

There are several algorithms which give relative importance of variables at OVERALL Model level. But the most influencing variable might not be the reason why a particular row might get higher or lower scores. eg: different applications get different reasons for Credit decline though scored by the same Credit Risk model. This happens because one variable's value drives the overall score for that particular application down, although this variable might not be the one which is the most significant across all rows at a model level.

I know how to calculate the top influencing variables in a Logistic / Linear Regression Model for each ROW (by rank ordering the product of coefficient & variable value for each variable).

But how do we calculate relative importance of variables at each ROW level using other algorithms like Neural network, Random Forest, etc.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    It's the stuff of sports or financial journalism to speculate why a team or or firm is successful (or not): sometimes there's a plausible reason (often it's just speculation or empty: the team did badly because it played badly). Otherwise the statistical focus is on identifying that a particular value is higher or lower than expected from a model for the data. You have to go beyond that model to think why. There isn't a statistical sense in which you can say that particular variables already in your model are more or less important for particular observations, because the fit is collective. – Nick Cox Jan 13 '16 at 11:02
  • From your comment answering @Nick Cox, I think you not so much want the importance of each variable on row level, in the abstract. What you want is to explain in words a decision taken based on the model. Say you reject the loan application if probability of default, from the model, is larger than a certain cutoff, like 0.01. Then you want an explanation of that decision in legal terms? – kjetil b halvorsen Jan 13 '16 at 12:59
  • 1
    Thanks for reply. Agreed that Variable importance is only valid at an overall model level from a statistical point of view. But in practical situations one might need to explain/justify the score. eg: Why a Loan application was declined (this is even a legal requirement) or Why should Sales team pursue a opportunity A rather than opportunity B (who's deal conversion probability score is higher), etc. So we need a good algorithm which while giving a reasonable justification about break-down of the score components, is still not contradictory to assumptions used in statistical model development. – plain_vanilla Jan 13 '16 at 13:04
  • 1
    @kjetil. Lets say we have a Neural net Model to detect Fraud applications. When i decline a application suspecting fraud, I would also need to explain #1 reason (variable) of decline, #2 & #3 reason of decline. It would be even better if i could say how much more impactful was #1 variable on the score than #2 variable on the score. I can do it with Logistic/Linear regression (as mentioned in my original question), but not using random Forrest, Neural Network etc. Which where i need help. – plain_vanilla Jan 13 '16 at 13:16

2 Answers2

0

What you are looking for is called Local Interpretation - why is a model making one (or some) particular prediction?

One way of doing this is to build a surrogate model which is more interpretable.

Take a look at Local Interpretable Model Explanation (LIME) for example, which is implemented in Python by the Skater package.

SlyFox
  • 1
0

An answer from game theory would be that you should use the Shapley Value. In a nutshell, the Shapley Value tells you, if you have a company which creates M units of profit, how to share those units fairly amongst its employees, where fairness is defined relative to productivity/value add of the employee.

The analogue in machine learning (sticking with binary classification for the sake of an explanation but similar analogues can be drawn in regression) goes as follows. Say that your classifier is predicting $P(Y=1|\underline{x})=0.83$ whereas the base rate, i.e. $P(Y=1)$ in your data is perhaps 0.53. Thus for this example, the probability of Y being equal to 1 is 0.3 higher than the base rate. This is like a company which made 0.3 units of profit, and wants to share that profit up fairly amongst its M employees (where M is the dimensionality of $\underline{x}$, i.e. the number of features).

In order to calculate these Shapley values, you must figure out what your classifier would be predicting, if it only had access to any subset of the full set of features (these subsets are known as coalitions), of which there are $2^{M}$, and thus calculating these in practice can be difficult and certainly requires numerical trickery for all but the most simple cases. For the state of the art in this field, see here (disclaimer, I am not an author of this paper but I do work with some of the authors). For an open source package which takes some shortcuts but is easy to use, try TreeShap

Note, it is very tempting to interpret Shapley values Causally and this is plain incorrect. For example, if your feature vector $\underline{x}$ contains two features which are highly correlated, where one causally affects the target and the other does not, they will likely have similar Shapley values (there are ways around this using asymmetric Shapley values, when you know the causal relationships, but Shapley values can't help you determine what's causative and what's correlative if you don't already know). The Shapley value of a feature must be interpreted strictly as "the amount by which this example is predicted by our model to be more/less likely than the baseline to be of the positive/negative class, is attributable this much to this feature"

Even then, this can be somewhat disappointing/counterintuitive. Using the above example again, when you have two features which are highly correlated (perhaps one equals the other plus white noise), they will likely have very similar Shapley values. If however, you retrained your model, removing one of these features, the remaining feature would simply double its Shapley value.

The main utility in my opinion, is exactly for the kinds of use case you are talking about. Broadly speaking, you know that low-income low credit score individuals are more likely to be rejected for a loan, but it's nice to know, at the individual level, which feature is being used more. This tells you nothing causative, it doesn't tell you how to improve your chances of not defaulting on a loan, but it can tell you how to improve your chances of getting a loan (i.e. should I work on my credit score or do I need to get a pay rise before the algorithm changes its mind about me?)

gazza89
  • 1,734
  • 1
  • 9
  • 17