Understanding which features were most important for logistic regression

Question

I've built a logistic regression classifier that is very accurate on my data. Now I want to understand better why it is working so well. Specifically, I'd like to rank which features are making the biggest contribution (which features are most important) and, ideally, quantify how much each feature is contributing to the accuracy of the overall model (or something in this vein). How do I do this?

My first thought was to rank them based on their coefficient, but I suspect this can't be right. If I have two features that are equally useful, but the spread of the first is ten times as large as the second, then I'd expect the first to receive a lower coefficient than the second. Is there a more reasonable way to evaluate feature importance?

Note that I'm not trying to understand how much a small change in the feature affects the probability of the outcome. Rather, I'm trying to understand how valuable each feature is, in terms of making the classifier accurate. Also, my goal is not so much to perform feature selection or construct a model with fewer features, but to try to provide some "explainability" for the learned model, so the classifier isn't just an opaque black-box.

I'd throw in that Random forests is also a good technique here. You can examine the top splits over the forest to gain intuition on which features contribute the most to the prediction. — , Feb 29 '16 at 02:44

Frank Harrell · Answer 1 · 2021-11-25T21:59:44.047

The first thing to note is that you don't use logistic regression as a classifier. The fact that $Y$ is binary has absolutely nothing to do with using this maximum likelihood method to actually classify observations. Once you get past that, concentrate on the gold standard information measure which is a by-product of maximum likelihood: the likelihood ratio $\chi^2$ statistic. You can produce a chart showing the partial contribution of each predictor in terms of its partial $\chi^2$ statistic. These statistics have maximum information/power. You can use the bootstrap to show how hard it is to pick "winners" and "losers" by getting confidence intervals on the ranks of the predictive information provided by each predictor once the other predictors are accounted for. An example is in Section 5.4 of my course notes - click on Handouts.

If you have highly correlated features you can do a "chunk test" to combine their influence. A chart that does this is given in Figure 15.11 where size represents the combined contribution of 4 separate predictors.

Thanks for the link to the course notes! The current link to your course notes seems to be broken (http://biostat.mc.vanderbilt.edu/RmS#Materials) and the correct one seems to be: https://hbiostat.org/doc/rms.pdf. — iamyojimbo, Nov 25 '21 at 17:07

score 6 · Answer 2 · edited Feb 25 '19 at 02:49

6

The short answer is that is that there isn't a single, "right" way to answer this question.

For the best review of the issues see Ulrike Groemping's papers, e.g., Estimators of Relative Importance in Linear Regression Based on Variance Decomposition. The options she discusses range from simple heuristics to sophisticated, CPU intensive, multivariate solutions.

http://prof.beuth-hochschule.de/fileadmin/prof/groemp/downloads/amstat07mayp139.pdf

Groemping proposes her own approach in an R package called RELAIMPO that's also worth reading.

https://cran.r-project.org/web/packages/relaimpo/relaimpo.pdf

One quick and dirty heuristic that I've used is to sum up the chi-squares (F values, t-statistics) associated with each parameter then repercentage the individual values with that sum. The result would be a metric of rankable relative importance.

That said, I've never been a fan of "standardized beta coefficients" although they are frequently recommended by the profession and widely used. Here's the problem with them: the standardization is univariate and external to the model solution. In other words, this approach does not reflect the conditional nature of the model's results.

edited Feb 25 '19 at 02:49

steadyfish

1,772
2
15
30

answered Feb 28 '16 at 13:46

Mike Hunter

9,682
2
20
43

Thanks for the answer and the links! Can you elaborate on or help me understand what "external to the model solution" and "the conditional nature of the model's results" means? (I'm not an expert in statistics, alas.) – D.W. Feb 29 '16 at 00:15
1

No worries. The notion of how models "control" or condition for the other factors in a model may be one of those things on which many statisticians can actually agree. It's also a topic that's seen a lot of commentary on this site. Here's a link to one such thread: http://stats.stackexchange.com/questions/17336/how-exactly-does-one-control-for-other-variables One of the best comments in it was by @whuber who said, 'You may think of "controlling" as "accounting (in the least square sense) for the contribution/influence/effect/association of a variable on all the other variables.' – Mike Hunter Feb 29 '16 at 11:56
Thanks! I'm familiar with the notion of "controlling for" some factor. How does that relate to or help understand the meaning of "external to the model solution" or "the conditional nature of the model's results"? – D.W. Feb 29 '16 at 16:01
Standardizing predictors to create a "standardized beta" is typically done before a model is built, correct? Therefore, that transform is "external" to the model's solution. With me so far? – Mike Hunter Feb 29 '16 at 16:09
OK. I can understand what you mean by "external" now -- thanks for the explanation. Can you explain why this is a problem, and what's meant by "the conditional nature..."? (Maybe those two questions are the same question with the same answer...) Sorry to pepper you with questions! I am eager to understand what you wrote. – D.W. Feb 29 '16 at 16:12
These are good questions. You're making me rethink this objection to stdzd betas. If the parameters are conditional, then shouldn't the standardizing also be conditional? In other words, if the betas are to accurately reflect rel importance, then something like conditional, "least square" means as defined by the model should be used or are a more properly used in the transformation – Mike Hunter Feb 29 '16 at 16:23

score 3 · Answer 3 · answered Feb 28 '16 at 06:31

3

A fairly robust way of doing this would be to try fitting the model N times where N is the number of features. Each time use N-1 of the features and leave one feature out. Then you can use your favourite validation metric to measure how much the inclusion or exclusion of each feature affects the performance of the model. Depending on the number of features you have this may be computationally expensive.

answered Feb 28 '16 at 06:31

Daniel Johnson

700
3
7

5

This does not handle correlated features well. It is easy to engineer a situation where two features are highly correlated, so that removing either one of them impacts the predictive power minimally, but removing *both* impacts it severely. Essentially, one in which the two predictors carry almost identical, but important, information. – Matthew Drury Feb 28 '16 at 06:37
2

I agree. This is also a danger when examining coefficients. – Daniel Johnson Feb 28 '16 at 06:39
1

Quite true. Quite true. – Matthew Drury Feb 28 '16 at 06:39

score 2 · Answer 4 · answered Feb 28 '16 at 07:03

You are correct in your observation that merely looking at the size of the estimated coefficient $|\hat{\beta_j}|$ is not very meaningful for the reason mentioned. But a simple adjustment is to multiply the coefficient estimate by the estimated standard deviation of the predictor $|\hat{\beta_j}| \hat{\sigma}_j$ and use this as a measure of importance. This is sometimes called a standardized beta coefficient and in logistic regression it represents the change in the estimated log odds of success caused by a one standard deviation change in $x_j$. One issue with this is that it breaks down when you're no longer dealing with numeric predictors.

Regarding your last point, of course it's possible that a variable might contribute a lot to the estimated log odds while not actually affecting the "true" log odds much, but I don't think this needs to be too much of a concern if we have any confidence in the procedure that produced the estimates.

Do you have any sources for this? Would be good to read the mathematical rigour behind it. If one had a mix of categorical variables and numerical variables going into a model, how would the problem of assessing the standard deviation problem be assessed? — Chuck, Apr 21 '20 at 13:50

score 0 · Answer 5 · answered Feb 25 '19 at 08:02

You are right about why you should not use the coefficients as a measure of relevance, but you absolutelly can if you divide them by their standard error! If you have estimated the model with R, then it is already done for you! You can even remove the least important features from the model and see how it works.

A more heuristic approach to study how different changes in the variables alter the outcome is doing exactly that: try different inputs and study their estimated probabilities. However, as your model is quite simple, I would usggest against that

Understanding which features were most important for logistic regression

5 Answers5

Linked

Related