6

Assume I'm a doctor and I want to know which variables are most important to predict breast cancer (binary classification). Two different scientists each present me with a different feature importance figure...

Logistic Regression with L2 norm (absolute values of model coefficients; 10 highest shown): enter image description here

And Random Forests (10 highest shown): enter image description here

The results are very different. Which scientist should I trust? Are one/both of these figures meaningless?

Code below; using the Wisconsin Breast Cancer data-set in scikit-learn.

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt

data = load_breast_cancer()
y = data.target
X = data.data

clf = LogisticRegressionCV(max_iter=3000)
clf.fit(X, y)
coefs = np.abs(clf.coef_[0])
indices = np.argsort(coefs)[::-1]

plt.figure()
plt.title("Feature importances (Logistic Regression)")
plt.bar(range(10), coefs[indices[:10]],
       color="r", align="center")
plt.xticks(range(10), data.feature_names[indices[:10]], rotation=45, ha='right')
plt.subplots_adjust(bottom=0.3)

clf = RandomForestClassifier(n_jobs=-1, random_state=42, n_estimators=400, max_depth=6, max_features=6) #has already been tuned
clf.fit(X, y)
coefs = clf.feature_importances_
indices = np.argsort(coefs)[::-1]

plt.figure()
plt.title("Feature importances (Random Forests)")
plt.bar(range(10), coefs[indices[:10]],
       color="r", align="center")
plt.xticks(range(10), data.feature_names[indices[:10]], rotation=45, ha='right')
plt.subplots_adjust(bottom=0.3)

plt.ion(); plt.show()
Oliver Angelil
  • 1,129
  • 1
  • 11
  • 24
  • 5
    I don't know Python that well, but are you using the coefficient values to assess importance for logistic regression? If so, you need to account for the standard errors. A meaningless variable may have a large coefficient, but also a large standard error. Both models are also affected by multicollinearity. You should see how removing a few variables affect your final importance rankings. – Peter Calhoun Mar 24 '18 at 02:53
  • 5
    It's not that they are meaningless, it's that they measure something about the *model* not something about the way the world works. There is no agreed upon notion of the predictive power of a variable, it's a completely informal and nebulous concept. It's only ever defined in the context of a specific model, so it ends up telling you more about the model than the world. – Matthew Drury Mar 24 '18 at 03:00
  • 1
    You may find this question useful: https://stats.stackexchange.com/questions/202277/what-are-variable-importance-rankings-useful-for – Matthew Drury Mar 24 '18 at 03:01
  • This is a good post about not lying with statistics generally https://gking.harvard.edu/files/abs/mist-abs.shtml but I think that a sufficiently motivated graduate student could make a name for him/herself by re-writing the same article and substituting "random forest" for "regression". – Sycorax Mar 24 '18 at 03:13
  • 6
    There's not a single definition of "importance" and what is "important" between LR and RF is not comparable or even remotely similar; one RF importance measure is mean information gain, while the LR coefficient size is the average effect of a 1-unit change in a linear model. Gary King describes in that article why even *standardized* units of a regression model are not so simply interpreted. Moreover, the LR and RF feature importance concepts are so far removed from each-other that grouping them under "feature importance" is really, really deceptive. – Sycorax Mar 24 '18 at 03:13
  • 2
    The idea that one measure is "right" completely misses the point that LR and RF provide completely different answers to the same question **by their very nature**, and that is doubly true when seeking to know what is "important." – Sycorax Mar 24 '18 at 03:16
  • 1
    Thanks for the useful answers and links. Now I'm confused why I see such feature importance figures pop up fairly often. Why should we care about them as @Matthew asked in his linked question. And as he said above, these numbers tell us what variables are most useful to the model for predictive power, and not what matters in the real world. Forgetting about RF and LR now, is there any way we could advise a doctor in terms of checkups for detecting cancer: e.g. "prioritise inspecting variable X_i over variable X_y. Say X_y is a completely useless variable (e.g. nose size)? – Oliver Angelil Mar 24 '18 at 04:52
  • 1
    Could the focus rather be on which coefficients end up being ~0? Say after using LASSO, can the variables whose coefficients are shrunk to 0 be used to advise a doctor to not prioritise checking those characteristics when trying to detect cancer? – Oliver Angelil Mar 24 '18 at 04:56
  • 2
    @OliverAngelil Why would you want a doctor to make a decision that way? The model is built to use *all of that information in conjunction.* – Sycorax Mar 24 '18 at 05:54
  • I know that. Forget about RF/LR for now. Suppose doctors are performing certain tests that are very costly and not very useful to detect cancer. What analyses could be performed to identify this? – Oliver Angelil Mar 24 '18 at 06:31
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/75016/discussion-between-oliver-angelil-and-sycorax). – Oliver Angelil Mar 24 '18 at 06:51
  • 1
    I feel like we all had a productive discussion, but that now this question really needs an answer! @OliverAngelil would you like to answer your own question? – Matthew Drury Mar 28 '18 at 13:48

2 Answers2

1

From your comments, it seems like what you are really after is feature selection - you want a set of models that use variable numbers of features (1, 2, 3, ..., N), such that incrementally adding a new feature yields as great an increase in model performance as possible. Then the decision makers can assess whether they want to carry out a costly procedure to obtain the data for an additional feature to use a more complicated model with greater precision/recall. We assume here that it costs the same to obtain the data for each feature.

In that case, I would separate your data into a training and test set; I would use cross-validation on the training set to select the best incremental feature (strictly speaking, you need to use nested cross-validation here, but if that is computationally infeasible or you don't have enough data we can verify that we did not overfit by cross-referencing CV results with test set results at the end). That is, you would start by trying each feature on their own, and choose the feature that gives you the best CV performance. You would then repeat the process to iteratively add additional features.

Note whether different CV folds show up with different best incremental features - if the variability is too high, this approach may not be feasible.

rinspy
  • 3,188
  • 10
  • 40
  • Don't you think what features are picked next to improve the model most will depend on the ML method used? E.g. logistic regression vs random forest. Or is the method irrelevant, but rather whatever one leads to the biggest improvement in test error? – Oliver Angelil Apr 05 '18 at 16:36
  • 2
    @OliverAngelil Yes, it might depend on the model used. So it makes sense to perform such feature selection on the model that you will actually be using, e.g. the one with the best out-of-sample performance. You might even want to ensemble several models, it doesn't matter - you perform this kind of feature selection using the model that you end up using. Different models giving you different important features is not necessarily a problem - it might indicate high variance, or maybe multicollinearity, or maybe your two models have low correlation in which case you should ensemble them. – rinspy Apr 06 '18 at 08:22
  • 2
    @OliverAngelil Of those cases, I would say only high variance is a problem for a predictive model. If you aim to establish some causality relationship to infer some knowledge from your model, it's a different story, of course. – rinspy Apr 06 '18 at 08:23
0

The question is ill-posed. We cannot advise the doctor that, for example, inspecting feature $X_a$ is more worthwhile than inspecting feature $X_b$, since how "important" a feature is only makes sense in the context of a specific model being used, and not the real world.

Logistic Regression and Random Forests are two completely different methods that make use of the features (in conjunction) differently to maximise predictive power. This is why a different set of features offer the most predictive power for each model.

Such feature importance figures often show up, but the information they are thought to convey is generally mistaken to be relevant to the real world. Why one would be interested in such a feature importance is figure is unclear.

Oliver Angelil
  • 1,129
  • 1
  • 11
  • 24