Using machine learning models in an explanatory way

Question

I'm hoping to learn some caveats to consider when training a machine learning model not for prediction but to point out features and feature values that significantly influence a target variable.

I have a data set with a large set of categorical attributes and one continuous target variable. I want to ask the question: Which attribute, and specifically, which attribute value is associated with a high value in the target variable?

Experimentally, I tried the following approach:

applied dummy coding/one-hot encoding to all categorical features
trained a regression model (sklearn.RandomForestRegressor) on the entire data set
looked at the result of sklearn.RandomForestRegressor.feature_importances_: They have a very skewed distribution in which some feature-value combinations pop out as very important compared to the rest

Now, is it methodically sound to infer that these are important factors influencing the target variable? What needs to be considered to make this conclusion valid? e.g.

is this a sound approach in general?
is one-hot-encoding the right preprocessing step?
are the same validation/optimization/parameter tuning methods to improve a predictive model relevant to such an "explanatory model"
the target variable has a very skewed distribution where 80% of items have a value of less than 3 and some a value of >100 - is that relevant?
...

Mike Hunter · Accepted Answer · 2017-11-25T13:56:18.267

This response isn't an answer to all of your questions but it does address some of them. First, while heuristics related to relative importance of features in ML are still in their infancy, this is not the case wrt classic, multivariate statistical modeling. For confirmation of this browse Ulrike Groemping's several papers and her R module, RELAIMPO, for an exhaustive review of the literature on relative variable importance, e.g., here ... https://cran.r-project.org/web/packages/relaimpo/relaimpo.pdf. Her work is enormously informative about the meaning and measurement of relative variable importance and should serve as the authoritative benchmark for any and all comparable ML approaches.

As the OP notes, Breiman's 2001 paper Random Forests (https://www.stat.berkeley.edu/users/breiman/randomforest2001.pdf) is one of the first suggestions wrt relative feature importance in the context of ML and it remains one of the best and most workable. One useful point to note is that while RFs were originally designed for use with CART trees, they are by no means limited to CART since any multivariate engine (e.g., logistic regression, OLS regression, econometric panel data models, whatever) can be plugged into an RF or, more generally, divide and conquer algorithm.

Another, more recent reference proposing a possible solution to relative feature importance is Sirignano, Sadwhani and Giesecke's paper titled Deep Learning and Mortgage Risk (here ...https://bfi.uchicago.edu/sites/default/files/file_uploads/Slides%20Giesecke.pdf). I would like to avoid getting too deeply into the weeds of their data and methodology but, to the authors credit, the explanations in their paper are quite clear and cogent -- there is no need to re-rehearse that in this short response. It's enough to note that at 2T of data spanning ~70% of residential US mortgages between 1995 and 2014 containing 120 million individual loans expanding out to ~3.5 billion monthly observations, Giesecke, et al's dataset is massive. Given that this much data is too big for evaluation on any machine on the planet, they employ a widely used divide and conquer workaround in creating millions of randomly drawn, bootstrapped "sequences of blocks of data," to enable estimation of their model(s) on an AWS, GPU-equipped, parallel machine.

Their approach is based on an ensemble of NNs with relative feature importance evaluated using LOOCV. However, as the authors note, their heuristic is limited by unmeasured sources of bias such as collinearity. That said, the problems with leveraging ML and, specifically, NN models for descriptive purposes run deeper than mere collinearity.

There are two key caveats to note in their approach. Taken together, these caveats greatly limit the potential for evaluating relative feature importance in the context of ML models:

1) They treat the ~3.5 billion records as iid and randomly draw bootstrapped samples based on that assumption

2) Their implementation of an ensemble of NNs requires features scaled in only 1 of 3 ways: continuous, dummy (0,1) and/or effect coded (1,0,-1) features. In other words, even simple multilevel categorical features such as 'state' require decomposition into 50 dummy (0,1) variables

First, treating all of the ~3.5 billion records as iid ignores the many dependencies and hierarchical relationships inherent to their data, e.g., classic time series issues such as autocorrelation, cointegration and stationarity, individual loans within zip code, zip code within county, both within state, and more. In short, while treating these ~3.5 billion observations as iid may be computationally efficient and produce useful (accurate) predictions, it is destroying variance structures essential to understanding the information for descriptive purposes.

Next, requiring that even small, multilevel categorical features such as 'state' be blown out into 50 dummy variables is also variance destroying. The comparison here is to R. A. Fisher's creation of ANOVA nearly 100 years ago and his treatment of blocks of categorical 'factors' such as soil type as a single independent variable. The issue becomes even worse wrt massively categorical features such as zip code which, in the context of NNs, necessitates creating 30,000+ dummy variables. Obviously, decomposing categorical information into a series of 0,1 dummies both disperses and decomposes a tremendous amount of valuable information into a virtually irrelevant string of variables -- no one cares about the impact of a few zip codes on a model. However, that same person should be interested in a composite, ANOVA-type factor summarizing their impact across all 30,000+ levels.

So, this response is suggesting in answer to your questions that:

1) ML models that ignore structure and variance for reasons of computational efficiency are not "sound approaches" for evaluating relative feature importance for descriptive purposes

2) "One-hot-encoding" may be necessary but is not useful for evaluating relative feature importance for descriptive purposes

3) "Are the same validation/optimization/parameter tuning methods to improve a predictive model relevant to such an 'explanatory model?'" Tuning methods able to capture important structural aspects in data are valid for use with explanatory models intended to evaluate relative feature importance for descriptive purposes

4) "The target variable has a very skewed distribution where 80% of items have a value of less than 3 and some a value of >100 - is that relevant?" Yes, this is highly relevant. Modeling sparse information is an area seeing a huge amount of research interest in the ML community. My view is that Nathan Kutz at the U of Washington is doing some of the best, if not the best work in this area. Check out his 2013 book Data-Driven Modeling & Scientific Computation: Methods for Complex Systems & Big Data which focuses on PDEs as well as his more recent, 1 hour Youtube video from June 2017 which has his more recent pubs related to these issues ... https://www.youtube.com/watch?feature=youtu.be&v=Oifg9avnsH4&app=desktop

/* Adding additional references to ML relative variable importance */

These references just popped up on AndrewGelman.com (11.25.17):

Using output from a fitted machine learning algorithm as a predictor in a statistical model http://andrewgelman.com/2017/11/24/using-output-fitted-machine-learning-algorithm-predictor-statistical-model/

“Why Should I Trust You?” Explaining the Predictions of Any Classifier http://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf

Thanks for the detailed answer. Since you are critical of the approach described above, can you suggest an alternative that would work for my categorial data? — clstaudt, Nov 16 '17 at 18:38
Any regression-based approach that can be integrated into a divide and conquer algorithm -- that is, if your data is so large one is needed. The key thing to circumvent the problems with the Giesecke, et al. approach is *not* to treat your records as *iid*. In their case, a more suitable unit of analysis that would have preserved structure and variance would have been the ids defining a unique loan. — Mike Hunter, Nov 16 '17 at 23:18
Relevant: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-categories — kjetil b halvorsen, Nov 25 '17 at 15:18
@kjetilbhalvorsen Thanks. If I made a contribution to that thread it would be to recommend *not* collapsing a categorical feature with large numbers of mutually exclusive levels, e.g., zip codes. The challenge for frequentist approaches to models with features possessing massive numbers of mutually exclusive levels is that inversion of the cross-products matrix is computationally insuperable in a single, full info iteration. By leveraging some variant of divide and conquer algorithms, creating random subsets of data, closed form solutions to matrix inversion become possible. — Mike Hunter, Nov 25 '17 at 19:12

Using machine learning models in an explanatory way

1 Answers1