5

Issues with stepwise regression are known to statisticians.

  1. It yields R-squared values that are badly biased to be high.
  2. The F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution.
  3. The method yields confidence intervals for effects and predicted values that are falsely narrow; see Altman and Andersen (1989).
  4. It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem.
  5. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani [1996]).
  6. It has severe problems in the presence of collinearity.
  7. It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses.
  8. Increasing the sample size does not help very much; see Derksen and Keselman (1992).
  9. It allows us to not think about the problem.
  10. It uses a lot of paper.

I want to focus on #5. I cannot access the Tibshirani paper, but I have run simulations that seem to show the parameters to be unbiased. I can believe that. Once we do the stepwise procedure and select the features to include in the model, we fit a linear regression. Under common assumptions, the Gauss-Markov theorem takes over and gives us unbiased estimates of the coefficients.

Why doesn't Gauss-Markov assure us of getting unbiased coefficient estimates?

(I realize that #2, #3, and #4 are huge problems for inference, whether the coefficients are biased or not, but I want to focus on the bias part for now.)

EDIT

Simulation code in R

library(MASS)
library(ggplot2)
set.seed(2021)
N <- 100
B <- 1000
x1 <- rep(c(0, 1), 2 * N)
x2 <- rep(c(0, 0, 1, 1), N)
x12 <- x1 * x2 
y_hat <- 1 + x1 + 0 * x2 - x12

b0_hat <- b1_hat <- b2_hat <- b12_hat <- rep(0, B)

# Define function (from Hadley Wickham) to suppress function printouts
# https://stackoverflow.com/a/54136863
#
quiet <- function(x) { 
  sink(tempfile()) 
  on.exit(sink()) 
  invisible(force(x)) 
} 

for (i in 1:B){
  y <- y_hat + rnorm(length(y_hat), 0, 0.25)
  d <- data.frame(x1 = x1, x2 = x2, x12 = x12, y = y)
  L <- lm(y ~ ., data = d)

  aic <- quiet(MASS::stepAIC(L))
  
  b0_hat[i] <- data.frame(t(aic$coefficients))$X.Intercept.
  
  if (length(data.frame(t(aic$coefficients))$x1) > 0){
    b1_hat[i] <- data.frame(t(aic$coefficients))$x1
  }
  
  if (length(data.frame(t(aic$coefficients))$x2) > 0){
    b2_hat[i] <- data.frame(t(aic$coefficients))$x2
  }
  
  if (length(data.frame(t(aic$coefficients))$x12) > 0){
    b12_hat[i] <- data.frame(t(aic$coefficients))$x12
  }
  
  if (i < 6 | i %% 50 == 0){
    print(paste(i/B*100, "% complete", sep = ""))
  }
  
}

d0 <- data.frame(estimate = b0_hat, coefficient = "Intercept")
d1 <- data.frame(estimate = b1_hat, coefficient = "beta1")
d2 <- data.frame(estimate = b2_hat, coefficient = "beta2")
d12 <- data.frame(estimate = b12_hat, coefficient = "beta12")
d <- rbind(d0, d1, d2, d12)
ggplot(d, aes(x = estimate, fill = coefficient)) +
  geom_density() +
  facet_grid(rows = vars(coefficient)) + #
  theme_bw()
t.test(b0_hat, mu = 1)
t.test(b1_hat, mu = 1)
t.test(b2_hat, mu = 0)
t.test(b12_hat, mu = -1)

Perhaps a key takeaway is that I would be including interaction terms, so even if the experimental design is orthogonal (I don't think mine will be, but maybe), the interactions will introduce correlation.

Nonetheless, those coefficient estimates sure don't look biased, and their $95\%$ confidence intervals contain the true parameter values (ditto when I do $50000$ iterations). With $1000$ iterations of the simulation, that should give us good power to reject pretty small differences.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • See https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection – Tim Jun 30 '21 at 12:46
  • 1
    Can you give some more details on your simulations? – Stephan Kolassa Jun 30 '21 at 13:01
  • Maybe this is a trivial point, but the problem you're describing isn't that the model under-performs due to high bias. The problem is over-fitting, which means that the model under-performs due to high variance. – Jdclark Jun 30 '21 at 13:56
  • 6
    I cannot see how Gauss-Markov applies, because it doesn't take into consideration either the inclusion of extraneous explanatory variables nor the omission of necessary explanatory variables: it assumes you already know the right variables to use and that the form of your regression model is the correct one. Neither is assured with stepwise regression. Thus, this situation shares many characteristics of regression with missing variables, which has known biases. – whuber Jun 30 '21 at 15:57
  • @StephanKolassa I define a bunch of variables $X$ and a parameter vector $\beta$ that is mostly zeros, and I calculate $\mathbb{E}[y] = X\beta$. Then I loop 1000 times, adding in noise $\epsilon$ to get $y = \mathbb{E}[y] +\epsilon$. In each iteration of the loop, I fit a stepwise model based on `lm(y ~ ., data = X)` and extract the coefficient estimates (calling the estimates zero if the corresponding variable is excluded from the final stepwise model). – Dave Jun 30 '21 at 17:31
  • 1
    Can you share your simulation code? – Mark White Jun 30 '21 at 21:31
  • 1
    Are you considering parameter estimates to be 0 if the variable is not included in the model by the stepwise regression procedure? – AdamO Jun 30 '21 at 23:30
  • Stepwise model building is also about as likely to include false predictors as true predictors, and about as likely to exclude true predictors and false predictors. Stepwise model building is the statistical equivalent of waving goat entrails over a fire and claiming oracular knowledge as a result. – Alexis Jul 01 '21 at 01:40
  • @AdamO Yes, I am. Should I be skipping those, or am I right to consider those to be estimated as zero? – Dave Jul 01 '21 at 03:00
  • @StephanKolassa I have included simulation code. The original simulation I did used forward, backward, and "both" in the `MASS::stepAIC` function, but this should give the idea. – Dave Jul 01 '21 at 12:52

1 Answers1

3

TL;DR: Yes, it does return biased coefficients.

I believe this is incorrect, if we are using the word "bias" in the way that people talk about it with regard to statistical and machine learning. (EDIT: Actually it IS correct, just a little confusing, explained below in the edit. Leaving this here so others can track my thought process, since it is a little tricky—for me at least!)

It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large; see Tibshirani [1996]).

I think they are using "biased" in the colloquial sense of "not good." Doing stepwise regression for variable selection is going to harm your inference, as p-values are going to be a mess since you "cherry-picked" the best variables. (There's a whole bunch of literature on this on "post-selection inference" specifically with regards to variable selection methods like stepwise or the lasso).

The problem with stepwise is that the coefficients are not biased—they will have high variance and will be sensitive to changing from different testing sets. In the long-run, like you find in your simulations, they will converge to the "true" coefficient and are thus unbiased.

But in the real world, we don't get to do many many studies, we don't know the "true" coefficient, and our model will always be an oversimplification, never fully explaining the true data generating procedure. So we actually introduce bias.

It's weird, because shrinkage methods are a way of introducing bias in an attempt to decrease variance. I wouldn't read too much into it—I really think they're just using the word "biased" in a colloquial sense instead of a statistical or machine learning sense.

I re-read Tibshirani (1996), and stepwise methods aren't included in their simulations. (Apparently this is attributed to Frank Harrell originally, who is on this site, and, uh, exponentially smarter than I am, so I'll see if I can find anything there.)


EDIT: OK, so I think I know where this argument comes from, after reading DOI: 10.1016/j.jclinepi.2011.06.016 and DOI: 10.1016/s0895-4356(99)00103-1.

The problem here is that the p-values are used for selection, and multiple regression controls for other variables, but only the variables in the model. So, if we use these invalid p-values and standard errors to estimate the model, we are letting these define what variables are included in the model. So, we end up overestimating the coefficients (biased away from zero), because the estimation of that coefficient is dependent on it's p-value and standard error.

That first reference I pointed to above (10.1016/j.jclinepi.2011.06.016) makes a pretty good analogy:

An intuitive example includes publication bias, where trials with statistically significant results have a higher chance of being published, leading to too extreme effect estimates in summaries of published reports.

And they do show simulation studies where stepwise biases the coefficients by estimating them to be too large.

Think of it this way: Stepwise methods are only going to include the variables on the occasions when they suck up enough variance to be significant. This is going to bias the coefficients upwards, because, in the long-run, we aren't even considering the non-zero coefficients that do not happen to be significant. It is truncating the possible distribution of coefficients we could get, so it moves the mean away from zero. That's bias.

I was confused, because often the goal is not to estimate unbiased coefficients. It's for good predictions. Indeed, we want to bias our coefficients toward zero (lasso) or toward the mean (mixed effects models) to decrease variance.


ANOTHER EDIT: Since I wasn't happy with my flip-flopping, I did my own simulation:

library(tidyverse)
set.seed(1839)             # set seed for randomization
n <- 100                   # sample size
B <- c(0, .25, rep(0, 9))  # intercept 0, beta1 = .25, betas 2 through 9 = 0
eps_sd <- 1                # sd of the error
iter <- 10000              # do it 10000 times

res <- map(seq_len(iter), \(zzz) {
  # make data:
  X <- cbind(1, replicate(length(B) - 1, rnorm(n, 0, 1)))
  y <- X %*% B + rnorm(n, 0, eps_sd)
  dat <- as.data.frame(X[, -1])
  dat <- cbind(dat, y)
  
  # do stepwise:
  m0 <- lm(y ~ 1, dat) # empty model
  m1 <- lm(y ~ ., dat) # saturated model
  fw <- step(m0, direction = "forward", scope = formula(m1), trace = 0)
  
  # return coefficient for V1 if it was estimated, NA otherwise:
  return(ifelse("V1" %in% names(coef(fw)), coef(fw)[["V1"]], NA))
})

# check it
res %>% 
  unlist() %>% 
  summary()

That should show you that the mean coefficient is .28, which is larger than .25.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-0.1947  0.2133  0.2699  0.2793  0.3357  0.7255    1435 

The main thing you might be running into with your simulation: If the effect size is big enough (the power is high enough) that you just so happen to estimate the coefficient on every single iteration, then it will be unbiased—because there is no "selection bias," as they were all selected. What you need to do is make the sample size low enough or the effect weak enough so that the model does NOT estimate it every time, meaning there is a selection bias toward coefficients that are sufficiently away from zero so as to be significant.

(Note that I have 1435 NA values where it was not estimated—this is where the selection bias comes from: We should be including those in our calculation of the mean coefficient, but we're not because forward stepwise introduced selection bias.)

Note that my simulation is one coefficient of .25 and the rest of 0—all of them uncorrelated. Things might get funkier when there are correlated variables.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Mark White
  • 8,712
  • 4
  • 23
  • 61
  • Which first reference, Altman and Andersen at the Stata link? – Dave Jun 30 '21 at 21:43
  • The first reference I linked to: 10.1016/j.jclinepi.2011.06.016 – Mark White Jun 30 '21 at 21:53
  • @Dave I added my own simulation and noted why your simulation might be giving you different results. – Mark White Jun 30 '21 at 22:05
  • I will look at your simulation in detail, but if the bias is $0.28$ when the true value is $0.25$, that would explain why I had trouble seeing bias. I just looked graphically, expecting to see major bias. Do you have a confidence interval for the coefficient estimate? Sure, you got $0.28$ as a point estimate, but if the confidence interval is $(0.22, 0.34)$, I would not be so comfortable calling it a biased estimate. // In my simulation, when a variable is not included, I count the coefficient as zero. – Dave Jul 01 '21 at 01:36
  • The sample size is arbitrary, since it's number of simulations. I could just up the number of simulations until the confidence interval was narrow enough to not include .25. The point being that those `NA`s due to not inclusion is the selection bias. There will still be bias, even if you count them as zero, in the case of correlated predictors. The model will include the one explaining more variance, and ditch the other one. The model now no longer controls for the other variable, which would have made the coefficient smaller—another example of how bias can creep in. – Mark White Jul 01 '21 at 12:43
  • If the estimates are unbiased, then increasing the sample size will not wind up rejecting beyond the expected rate. As the sample size increases, we would expect narrowing confidence intervals (whether the estimates are biased or not), but we also would expect, in the case of unbiased estimates, convergence of the point estimates to the true value of $0.25$. // I just added my simulation code and am in the process of running $50000$ simulations instead of just $1000$. I suspect the confidence intervals will contain the true values, even with the large sample size. – Dave Jul 01 '21 at 12:50
  • I have an idea for another simulation—I will edit and include it when I can. Thanks for sharing your code—it helps, as well. – Mark White Jul 01 '21 at 16:40
  • I tried your simulation. The $95\%$ confidence interval is completely above $0.25$. If you count the NA values as $0$s, the $95\%$ confidence interval is completely *below* $0.25$, though I am skeptical about the $t$-based procedure for the confidence interval. Perhaps a bootstrap confidence interval would be in order. – Dave Jul 06 '21 at 00:07