21

I am well aware of the problems of stepwise/forward/backward selection in regression models. There are numerous cases of researchers denouncing the methods and pointing to better alternatives. I was curious if there are any stories that exist where a statistical analysis:

  • has used stepwise regression;
  • made some important conclusions based on the final model
  • the conclusion was wrong, resulting in negative consequences for the individual, their research, or their organisation

My thought on this if stepwise methods are bad, then there should be consequences in the "real world" for using them.

amoeba
  • 93,463
  • 28
  • 275
  • 317
probabilityislogic
  • 22,555
  • 4
  • 76
  • 97
  • 2
    If you don't find any such stories, it might be because stepwise regression is mostly used in basic research (or so I perceive). Basic researchers don't usually get in trouble for being wrong, so long as they didn't fake the data or something. – Kodiologist Aug 19 '16 at 18:21
  • 3
    It's used a lot in the industry and in the class room. In research the authors probably would not disclose that they used it. In the industry the main two reasons are that a) those who are doing it were not trained in research, e.g. have undergraduate degrees or b) graduated decades ago. – Aksakal Nov 23 '16 at 20:25
  • @Aksakal Not learning to begin with but getting a sheep skin anyway is the problem, not elapsed time. *Exemplis gratis*, me. I took one stats course circa 1971, and first used stats in a publication circa 2006. – Carl Nov 23 '16 at 21:06
  • Related: [Under torture, the data may yield false confessions. Examples?](https://stats.stackexchange.com/q/323227/) – gung - Reinstate Monica Jan 15 '18 at 18:00

1 Answers1

3

There is more than one question being asked. The most narrow one is asking for an example of when stepwise regression has caused harm because it was perfomed stepwise. This is of course true, but can only be established unequivocally when the data used for stepwise regression is also published, and someone reanalyses it and publishes a peer reviewed correction with a published primary authors' retraction. To make accusations in any other context risks legal action, and, if we use a different data set, we could suspect that a mistake was made, but "statistics is never proving anything" and we would not be able to establish that a mistake was made; "beyond a reasonable doubt".

As a point of fact, one frequently gets different results depending on whether one does stepwise elimination or stepwise buildup of a regression equation, which suggest to us that neither approach is sufficiently correct to recommend its usage. Clearly, something else is going on, and that brings us to a broader question, also asked above, but in bullet form, amounting to "What are the problems with stepwise regression, anyhow? That is the more useful question to answer and has the added benefit that I will not have a law suit filed against me for answering it.

Doing it right for stepwise MLR, means using 1) physically correct units (see below), and 2) appropriate variable transformation for best correlations and error distribution type (for homoscedasticity and physicality), and 3) using all permutations of variable combinations, not step-wise, all of them, and 4) if one performs exhaustive regression diagnostics then one avoids missing high VIF (collinearity) variable combinations that would otherwise be misleading, then the reward is better regression.

As promised for #1 above, we next explore the correct units for a physical system. Since good results from regression are contingent upon the correct treatment of variables, we need to be mindful of the usual dimensions of physical units and balance our equations appropriately. Also, for biological applications, an awareness and accounting for the dimensionality of allometric scaling is needed.

Please read this example of physical investigation of a biologic system for how to extend the balancing of units to biology. In that paper, steps 1) through 4) above were followed and a best formula was found using extensive regression analysis namely, $GFR=k∗W^{1/4}V^{2/3}$, where $GFR$ is glomerular filtration rate, a marker of catabolism, where the units are understood using fractal geometry such that $W$, weight was a four dimensional fractal geometric construct, and V, volume, was called a Euclidean, or three dimensional variable. Then $1=\frac{1}{4} \frac{4}{3}+\frac{2}{3}$. So that the formula is dimensionally consistent with metabolism. That is not an easy statement to grasp. Consider that 1) It is generally unappreciated (unknown) that $GFR$ is a marker of metabolism. 2) Fractal geometry is only infrequently taught and the physical interpretation of the formula presented is difficult to grasp even for someone who has mathematical training.

Carl
  • 11,532
  • 7
  • 45
  • 102
  • 3
    This seems to describe a problem with regression in general, rather than stepwise regression specifically. – Accidental Statistician Nov 23 '16 at 20:32
  • @AccidentalStatistician Indeed it does, and I believe it is not advisable to present the issue otherwise. However, notice the inclusion of the following in that message: "and if one uses all permutations of variable combinations, not step-wise, all of them, and if one performs exhaustive regression diagnostics then one avoids missing high VIF (collinearity) variable combinations that would otherwise be misleading." – Carl Nov 23 '16 at 20:37
  • @AccidentalStatistician I should add that investigation of high collinearity variable combinations can lead to the discovery of non-linear variable relationships, so that high collinearity is not an end-point, it requires systematic further investigation to discover its cause. – Carl Nov 23 '16 at 20:47
  • 2
    Yes, these are aspects of regression to consider in general. If I understand correctly where the question is coming from, though, it's motivated by stepwise regression often being denounced in favour of using the likes of LASSO, which wouldn't address the concerns you give here. – Accidental Statistician Nov 23 '16 at 21:03
  • 2
    I'm a little confused about your "units" point. The BMI example is trivially wrong because BMI is defined as a different combination of the factors on the right hand side, and no coefficents will ever fit correctly. However, you can imagine predicting (e.g.,) 40 yard dash times (sec) from weight(kg), height (m), and other factors, none of which are in temporal units. In this case, the weight coefficient would be something like dkg/dsec. In fact, I can't think of too many examples where the predictors and prediction have the same units, except maybe time series. – Matt Krause Nov 23 '16 at 21:05
  • @MattKrause What your are describing is heuristic, which I am discouraging. The units should combine to form a sound theoretical construct or the result is not physics. For example, for a formula I found using regression, $GFR=k*W^{1/4}V^{2/3}$, the units are understood using fractal geometry, where GFR is glomerular filtration rate, a marker of catabolism, W is weight, a four dimensional fractal geometric construct, and V, volume, is called a Euclidean, or three dimensional variable. Then $1=\frac{1}{4} \frac{4}{3} + \frac{2}{3}$. So that the formula is metabolically dimensionally consistent. – Carl Nov 23 '16 at 21:25
  • 5
    I appreciate your frankness and your good will in this matter, Carl. I will not deny that voting has its problems. The only effective way I know of changing the voting on a post is to change the answer--either to improve it technically, expand on it, or to communicate the ideas differently--and even then there's no guarantee it will get the desired response (or even any response at all!). Sometimes, respectful efforts made to *understand* the downvoters will elicit information that helps everyone appreciate (and upvote) such efforts at improving a post. – whuber Nov 23 '16 at 23:30
  • 2
    In the spirit, hopefully, of @whuber's comment above, my initial comments are off-target. My problem with the answer is not that it talks about problems with regression in general, it's that it does so without giving the sort of example the question asks for. – Accidental Statistician Nov 23 '16 at 23:41
  • 4
    @Carl I'd think that if you're getting regular downvotes the first thing to do is to consider how you might improve your posts (and often you have comments under them that suggest improvements). Speaking for myself, even where I disagree with a commenter, it turns out that they often raise issues that lead to a better answer anyway. I will say that I regularly notice issues with your answers that would nearly move me to downvote it myself. Where I have time to do so, I try to leave a comment. – Glen_b Nov 23 '16 at 23:45
  • 1
    I'm not sure if I understand you correctly but in one part you seem to be suggesting that all-subsets can avoid the problems of stepwise regression. Is that what you're saying? – Glen_b Nov 23 '16 at 23:51
  • @Glen_b All subsets avoids some of the problems of MLR. It can show non-linearity interference or covariance between variables that just do not work properly together. However, it does not fix the all important question, "Should I be regressing this, this way anyway?" For example, taking logarithms to find power function regressions is probably underutilized. Moreover, the substitution of "non-linear terms" in the multiple linear regression can do wonders, trivial example, $Y=A-B\ln (X)+++$ may work much better than $Y=mX+b+++$ – Carl Nov 24 '16 at 00:02
  • 4
    Note that many of the problems of stepwise regression -- such as issues with estimates biased away from 0, standard errors biased toward 0, nominal type I error rates much lower than actual and a variety of other problems are still present with all-subsets -- indeed, it's an issue with almost any form of optimizing (chapter 4 of Frank Harrell's *Regression modeling strategies* is a useful reference). Shrinkage/regularization can mitigate some of these issues (especially the tendency of selection to bias estimates outward) and out-of-sample assessment is an important tool for many of them. – Glen_b Nov 24 '16 at 00:12
  • @Glen_b Indeed, and the shrinkage can be chosen to actually accomplish something useful, as in targeting the objective regression goal that motivated our performing the regression in the first place. However, in doing that latter we abandon goodness-of-fit and maximum likelihood regression, so there is a price to pay Note as well, that multiple Theil regression can be performed to reduce bias in some cases. Where can I get Chapter 4? Is it downloadable? – Carl Nov 24 '16 at 00:22
  • 1
    If we're entertaining selection between large numbers of non-nested models we had already abandoned most of the usual theory related to likelihood anyway, since it's predicated on knowing\* the very things that we're selecting (and optimizing) over -- what's in the model. $\qquad$ \*(at least knowing it down to a simple comparison of nested alternatives) – Glen_b Nov 24 '16 at 00:33
  • @MattKrause Regarding 40 yard dash times (sec) from weight(kg), height (m). Before I did anything I would look up the allometry for height versus stride length etc.. Without doing that I suspect that $\frac{1}{T_{40}}\approx k W^a H^b A^c$, that $W$ and $H$ are highly collinear, that $A$, age in years, reduces some of the collinearity, and I would not be satisfied with that. I think it needs a lot of thinking. – Carl Nov 24 '16 at 01:12