How to gently introduce epidemiologists/public health coworkers to advanced predictive modeling?

Question

Coming from a social science and epidemiology background, my coworkers were trained on least squares regression, logistic regression, and survival analysis. They like to see 95% confidence intervals and p-values with the parameter coefficents, and are distrustful of more current predictive tools such as neural networks, CART, bagging & boosting, as well as penalized regression techniques.

My short course is aimed at that audience, among others. Info including handouts is at the web site for the full semester version of the course: http://biostat.mc.vanderbilt.edu/CourseBios330. One of the many things I cover is why it is unreasonable to anti-log logistic regression coefficients to get odds ratios; this is in the context of allowing effects to be nonlinear and getting e.g. inter-quartile-range odds ratios. — Frank Harrell, Apr 07 '14 at 14:31
I like the following 2 books: An Introduction to Statistical Learning: with Applications in R (James, Witten, Hastie and Tibshirani). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Hastie, Tibshirani, Friedman). I also find that epidemiologists prefer parametric regression models (the GLM type models you mention) for estimation and inference compared to modern semi/non-parametric regression models. I suppose because much of their work focuses on discovery/explanation of risk/protective factors, rather than generating flexible regression fits (for prediction)? — Chris, Apr 08 '14 at 02:01
Thank you Chris, indeed I own the ESL textbook and it's been a valuable reference. I've found that epidemiologists have learned a set of statistical tools in graduate school for building experiments and assessing outcomes, and then reflexively use those same tools when building a model for, say, predicting future costs. It's very much like they're stuck in a 1980s time warp when it comes to statistical analysis. — RobertF, Apr 08 '14 at 15:21
@RobertF: That is true for most profession (inertia?). However, epidemiologists are usually interested in explanatory models and it's not always entirely clear how more novel predictive approaches like penalization ought to be utilized when for example one wishes to assess confounding, interaction on an exposure of interest. Frank Harrell's course, book and his dept. site contains lots of useful material that can be applied in epidemiology as well. — Thomas Speidel, Apr 15 '14 at 15:47
@ThomasSpeidel - I can see inertia setting in as researchers & professionals in the health care field move into middle management and beyond and are out of touch with new developments in statistics. The question of explanatory or predictive is certainly important for variable selection: for example, using past health costs to predict future costs is legitimate in a predictive model, but wouldn't fit into an expl model. However, even expl models ought to use penalized regression to shrink coefficients of correlated variables (which are unbiased for large samples but not necessarily small). — RobertF, Apr 15 '14 at 22:27
@RobertF: "even expl models ought to use penalized regression to shrink coefficients of correlated variables". I'm not sure: can we still interpret a penalized coeff.? — Thomas Speidel, Apr 16 '14 at 02:16
@ThomasSpeidel - Yes, the coefficients from a penalized OLS regression still have the same interpretation as coefficients from a standard OLS regression. The penalized coeff.'s may be biased (shrunk towards zero) to control for multicollinearity. That throws some people - the fact we're purposefully introducing bias into penalized regression to improve predictive accuracy. Also there are no p-values or 95% confidence intervals associated with the parameter coefficients generated from a penalized regression, which makes some researchers uncomfortable. — RobertF, Apr 16 '14 at 14:55
@RobertF: This is something I have a hard time grasping. If we are purposely biasing effects estimates to reduce overfitting, how can we treat them as if they weren't biased when we want to interpret them? — Thomas Speidel, Apr 23 '14 at 20:05
@Chris One of the most common survival analysis tools epidemiologists use is a semi-parametric model. — Fomite, Sep 30 '15 at 23:52

Fomite · Accepted Answer · 2015-10-01T02:17:51.533

6

I'm going to weigh in as an Epidemiologist.

I can see inertia setting in as researchers & professionals in the health care field move into middle management and beyond and are out of touch with new developments in statistics.

First, I would strongly advise you not to assume this is simply inertia, either in the form of the discipline not wanting to adopt new techniques, or your coworkers falling out of touch with new developments in statistics. You can go to academic epidemiology conferences where new and very methodologically sophisticated work is being done, and still not necessarily find much on predictive modeling.

The hint is in the name. Predictive modeling.

Epidemiology, as a field, is not particularly interested in prediction for it's own sake. Instead, it's focus is on developing etiological explanations for observed disease patterns in a population. The two are related, but distinct, and this often leads to something of a philosophical distrust of more modern classification and prediction techniques that purely attempt to maximize the predictive impact of a model. At the extreme end of this is the people who are of the opinion that variable selection should be performed primarily with the use of something like a directed acyclic graph, which could be considered the opposite of where predictive modeling is heading. This is largely why the methodological developments in epidemiology have been concentrated in causal inference and systems models in the recent past - both are built off etiological and causal arguments, rather than prediction.

This results in it not being part of their background, not being something they encounter much in the literature, and to be perfectly frank, a high likelihood that their exposure to it has been via people who don't actually understand the problems they are trying to solve.

This, in the comments, is a perfect example:

That throws some people - the fact we're purposefully introducing bias into penalized regression to improve predictive accuracy

Very nearly every epidemiologist I know, if you made them pick, would pick a reduction in bias over an increase in accuracy.

That is not to say that it never gets brought up. There are times when predictive models do get used - often in clinical cases where the prediction of this particular patient's outcome is of considerable interest, or outbreak detection, where these techniques are useful because we don't know what's coming and can't make etiological arguments. Or when prediction really is the goal - for example, in many exposure estimation models. They're just somewhat niche in the field.

edited Oct 01 '15 at 02:17

answered Oct 01 '15 at 00:08

Fomite

21,264
10
78
137

I apologize if this is a bit offensive, I don't mean it to be. How much of an epidemiologist's training is in statistics and/or mathematics? Purely from my own experience, the epidemiologists that I have met (and I've met a considerable number) have been ill equipped statistically to use and interpret the models that they have shown. A lot of them have been ignorant to basic concepts like multiple testing correction and other practical issues. I was wondering if you could comment on this. Have I simply met bad epidemiologists, or is it a discipline wide phenomenon? Again, I hope that wasn't – Chris C Oct 01 '15 at 02:18
(cont.) offensive, and I think that lots of people in your field do fantastic work. – Chris C Oct 01 '15 at 02:18
1

@ChristC Part of the problem is, compared to say, "Statistician", epidemiologist is a very wide field. There are many people who can get away with 2x2 tables and math no less complex than long division, because for most local public health problems, that's enough. 1/n – Fomite Oct 01 '15 at 02:24
(cont.) I would also like to add that I am only a student, and may be completely mistaken, so please feel free to correct me. – Chris C Oct 01 '15 at 02:24
1

There's also some quirks of the field (Charlie Poole at UNC has an argument re: multiple comparison corrections being a flawed concept in Epidemiology), and some lack of education because most epi's are *users* of models, and to be frank, statistics programs are often utterly disinterested in teaching them. 2/n – Fomite Oct 01 '15 at 02:27
1

At the other end of the spectrum, you have some very sophisticated methodologists working on epi-focused problems - causal inference, systems models, competing risks, etc. who are extremely knowledgeable. It all just very much depends on what kind of work they do, their background, etc. n/n – Fomite Oct 01 '15 at 02:28
1

@ChrisC A particularly illustrative example I just remembered. At the same conference, in the *same session*, I was presenting a new (if somewhat derivative) approach to modeling seasonality using regression models with harmonic functions in them. The talk before mine? Pie charts. Both very credibly could be called "Epidemiology". – Fomite Oct 01 '15 at 05:11
Thanks for the comments Formite, I've also noticed the reluctance of pure statisticians to teach confused individuals. Nothing against either party, as everyone is busy, but it is definitely an issue. Also, I've witnessed the exact same thing as you at conferences. I was at an epi conference last summer listening to the oral presentations, and there was about 10 presentations which really only used a $t$-test as any kind of statistics to back up their claims, and then came in someone like yourself who presented a model so mind-bogglingly complex that I had a two page list of terms to google. – Chris C Oct 01 '15 at 15:29
(cont.) Unfortunately, the student with the simplest to understand presentation ended up winning the oral presentation. I talked to the presenter with the complex model later on and learned quite a lot, but the audience really wasn't prepared for such a rigorous presentation. It's unfortunate, and it wasn't even because the student was a bad speaker; they were not. Additionally, I'll be looking into Dr. Poole's MTC essay, that's my field of research so thanks for the tip. All the best, and thanks for your insightful opinions. :) – Chris C Oct 01 '15 at 15:32
@Fomite - Thank you for the insightful response. IMO the difference between explanatory vs. predictive regression models lies in the initial choice of variables used in the model (e.g., making sure an "effect" variable like lung cancer incidence is not predicting a "cause" variable like # cigarettes smoked per week). DAGs can certainly be an important part of this. – RobertF Oct 30 '15 at 17:21
@Fomite - (cont.) However, the same techniques for coefficient estimation and variable selection such as lasso, ridge regression, or elastic net can be applied to both predictive *and* explanatory models. If a researcher is still using outdated techniques like stepwise regression to select variables, he's got bigger problems than having a little bias in his model parameters. – RobertF Oct 30 '15 at 17:22

How to gently introduce epidemiologists/public health coworkers to advanced predictive modeling?

1 Answers1