Explaining feature importance to non technical persons

Question

I am collaborating with a team of biologists who are trying to classify some datasets based on lab experiments and observations.

The dataset is interesting for a binary classification, I compared different algorithms on the dataset and was able to achieve a good classification after some data cleanup and normalization.

The question I do have is that despite the process of feature selection and the feature importance I generated, the team still need to see cutoffs on the feature values that switch the outcome to lets say A or B class, and this is fair enough, only the classic decision tree algorithm will be able to output such a visualization as far as I can tell, but other algorithms perform much better than a simple decision tree and visualization of such "cutoff" is not something I can think of.

In your experience, is there a way of representing feature importance differently in the sense "cutoff" that determines which class will be predicted before or after this cutoff

I am aware that this is a univariate way of looking at the data, but I am trying to approximate things for a general public and not for machine learning practitioners

Could you say more about the type of classification algorithm you settled on and wish to illustrate in this way? Also, how many cases were in each class, and how many predictors were involved? — EdM, Dec 14 '16 at 21:18
The classification is a mortality outcome (live/die) based on some experimental features and measures (weight, temperature, dosage etc ..) 44 in total for a number of instances 800. I tested several classifiers, KNN, RandomForest, XGBoost, SVM etc .. 10 algorithms in total, data was normalized (rescaled) and NA imputed to mean. I used 10 fold cross validation classification, and GradientCV to optimize each algorithm individually — Rad, Dec 14 '16 at 21:26
They need to know for each attribute the cutoff that causes death (switch the prediction to `die` class). — Rad, Dec 14 '16 at 21:29

score 1 · Accepted Answer · edited Jun 11 '20 at 14:32

It's clearly not possible "to know for each attribute the cutoff [emphasis added] that causes death" for any one predictor, independent of the values of all the other predictors, even for a simple decision tree. In a logistic-regression classification the cutoff is some value of the combined linear predictor rather than the value of any individual predictor.

There are ways to gauge the feature importance of an individual predictor in terms of its contribution to a classification model. An Introduction to Statistical Learning (ISLR) covers this at several places through its text, often with exercises.

Perhaps the best service you could provide to your colleagues about "feature importance," however, would be to document how variable the estimates of feature importance can be when predictors are correlated. Frank Harrell succinctly describes the difficulties on this Cross Validated page. Whatever measures of feature importance you use, try repeating the process on multiple bootstrap samples of the original data and see how the relative rankings of features differ among samples. The results can be quite distressing to those who place much importance on "feature importance."

I suspect that you are already familiar with measures of variable importance in classification, but as this site tries to provide a repository of answers that can be useful to others later I'll nevertheless provide a few examples.

Regression coefficients in logistic regression (ISLR, Table 4.3) provide at least two ways to proceed, either using the absolute values of coefficients for normalized predictors or using the coefficient Wald statistics as recommended for example by Harrell in the context of Cox survival models.

The R randomForest package provides two measures of individual variable importance in its importance() function. Quoting from the help page:

The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).

The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.

ISLR illustrates this in Figure 8.9 and in exercise 8.3.3.

Section 4.5 of ISLR (p. 151) baldly states: "KNN does not tell us which predictors are important; we don’t get a table of coefficients as in Table 4.3." At first glance, however, I don't see why some permutation-based estimate of variable importance couldn't be devised for KNN. I have little personal experience with KNN, so this might already have been done.

Feature selecton algorithms, developed to help deal with high-predictor/low-numbers scenarios in different classification approaches, also depend on measures of variable importance that presumably can be extracted for your 44-predictor/800-instance data set.

But again, showing your colleagues the variability of feature importance rankings would be doing them an important service, even if they don't at first like what they see.

Explaining feature importance to non technical persons

1 Answers1