4

Setting: Experimental data and control-variables

I want to evalute the (average) treatment effect in a randomized controlled trial. Individuals $i$ in one group received a treatment ($D_i=1$) and in another group a placebo ($D_i=0$).

In order to reduce estimation uncertainty I control for pre-treatment variables which predict the outcome (e.g. survival is prediced by age, this holds regardless of treatment). The regression equation I estimate is: $$Y_i = \alpha + \beta D_i + \gamma X_i,$$where $X_i$ is a vector of covariates.

Selecting $X_i$

There are several aspects to be considered when deciding on $X_i$. By design (randomization) the covariates will be uncorrelated (in expectation) with treatment status. Their inclusion may increases the precision of the $\beta$-estimation if they "suck-up" noise in the outcome variable, but they might also increase estimation uncertainty by costing degrees of freedom.

The choice of covariates is hence non-trivial and of course also leaves room for p-hacking.

My question: Do stepwise procedures (such as the stepwise elimination of variables with insignificant coefficients) invalidate inference regarding $\beta$?

My understanding is that such proceedures generally lead to overfitting. Yet, if I only apply it to variables within the vector $X_i$, and the only inference I care about is regarding $\beta$ I fail to see how this would become a problem. What would be an example for a data generating process illustrating this?

Edit: Schematic example for a stepwise proceedure.

  1. Let $X_i$ contain all base-line variables, transformations of those (e.g. logs squares), and their interactions.
  2. Estimate $Y_i = \alpha + \beta D_i + \gamma X_i$, using OLS
  3. Drop from $X_i$ all entries variables for which the coefficient estimate in 2 had a p-value below, say, 5%
  4. If any variables were dropped in 3., goto 2, if not continue to 5
  5. Report the estimate for $\beta$ obtained in the last iteration of 2.

My intuition is that this would yield unbiased estimates of $\beta$, but biased estimates of $SE[\hat\beta]$. But I might be wrong and happy to be proven wrong. If estimates for $\beta$ would be biased I would be interested to hear why/see an example for such a D.G.P.

EDIT2: I will select the answer to this question, which provides at least a scematic example of a d.g.p. with which I would get undersized test restults for the ATE, under random treatment assignment (even if that only works in small samples).

sheß
  • 317
  • 4
  • 23
  • Before this question derails into something to specific, I posted the example-coding follow-up question somewhere else: http://www.statalist.org/forums/forum/general-stata-discussion/general/1349822-where-is-the-problem-with-stepwise-proceedures – sheß Jul 19 '16 at 09:52
  • Stepwise selection does not improve any aspect of estimation. What made you think it did? – Frank Harrell Jul 20 '16 at 14:04
  • 2
    I'm struggling to appreciate the question. If there indeed is an effect $\beta\ne 0$, then the null hypothesis is false. Thus the appropriate sense of "bias" must be *conditioned* on the assignment into treatment and control. If that assignment *happens* to be orthogonal to the covariates, you will be fine; but if it is not, then there's nothing special or different about your data than any other dataset and we can apply what is generally known about stepwise procedures. – whuber Jul 20 '16 at 14:46
  • @Frank, the idea would be that it improves something (regarding the point estimates) if little is known about which controls would improve estimation. Then the alternatives (i) using no controls or (ii) using all controls, could both give less precise point estimates, no? Of course, using the right controls is always best, but they are unknown in most settings. – sheß Jul 20 '16 at 15:20

1 Answers1

8

Yes, stepwise methods invalidate inference in this setting. Variables are retained because either (1) they are truly strong or (2) their effects are mis-estimated to be too far from zero. This creates a selection ("publication") bias. Even more clearly, variable selection results in a biased-low estimate of $\sigma^2$ which you can almost see from just looking at the formula for $\hat{\sigma}^2$ which is the sum of squared residuals divided by $n - p - 1$ where $p$ is the number of slopes in the model. With stepwise variable selection, one could say that $p$ is dishonestly low. Analytsts take $p$ to be the number of retained variables and not the number of candidate variables. The formula mandates that $p$ be non-stochastic, i.e., pre-specified.

Generally speaking, it is a mistake to think that stepwise variable selection improves estimation of $\beta$. Using outside knowledge to specify variables to include in the model can improve everything, but using the same dataset to select which parameters to estimate does not. Some of the information in the data is diverted to vainly try to answer the question "which predictors are important?". This information is better used in estimation of effects.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • Thank you. What I find confusing still is the following: "Normal" p-hacking using control variables (i.e. picking controls until you find the lowest p-value for your variable of interest) is still somewhat different from the above proceedure, no? Is it correct to say that normal p-hacking will lead to biased (away from zero) ATE-estimates as well as overestimated precision, while such a stepwise proceedure would only overestimate precision (by underestimating $\hat{\sigma^2}$)? – sheß Jul 14 '16 at 12:42
  • 2
    I don't think it matters but would need you to write out a proposed step-by-step algorithm to know for sure. – Frank Harrell Jul 14 '16 at 12:56
  • I don't understand how "it does [not] matter" would be an answer to my question. Can you clarify please? I will edit a proposed step-by-step algorithm into the original question. – sheß Jul 14 '16 at 13:03
  • 1
    The algorithm you described is a classic stepwise variable selection approach where a few of the variables are mandated. It suffers from **all** the problems of stepwise variable selection. – Frank Harrell Jul 14 '16 at 13:45
  • Sure. The only thing that sets me off is that the mandated variable is randomized and hence independent from all the other's. Doesn't that imply that the endogenous variable selection will not bias it's estimate? – sheß Jul 14 '16 at 13:48
  • 1
    I believe that reduces the bias (but not for $\sigma^2$) but doesn't make it zero. This is related to the bad practice in randomized clinical trials of using stepwise testing to decide which variables are included in a model. You can find meaningless chance imbalances that way, and include variable because they are slightly co-linear with treatment, which hurts the estimate of the randomized treatment effect. – Frank Harrell Jul 14 '16 at 16:19
  • I see, the bias probably only vanishes asymptotically. – sheß Jul 14 '16 at 16:21
  • 4
    For models that are nonlinear in $\beta$ even orthogonality doesn't protect from bias. – Frank Harrell Jul 16 '16 at 12:15
  • can you elaborate on this? – sheß Jul 18 '16 at 09:44
  • 2
    The main idea is that things like odds ratios and hazard ratios are "non-collapsible" meaning that in nonlinear models that do not have an error variance, model misspecification cannot just inflate $\sigma^2$; it spills over into modifying the $\beta$s. – Frank Harrell Jul 18 '16 at 18:38
  • Thanks, I guess this thread becomes convoluted if I keep asking follow up questions. However, I still find no convincing evidence that in my specific context backward stepwise methods cause issues. I post the question on the Stata-list instead http://www.statalist.org/forums/forum/general-stata-discussion/general/1349822-where-is-the-problem-with-stepwise-proceedures – sheß Jul 20 '16 at 08:54
  • 2
    Destruction of the error variance, which affects every aspect of inference (except for $\beta$ point estimates perhaps) is plenty of reason not to use stepwise methods in this context. – Frank Harrell Jul 20 '16 at 11:47
  • If other aspects (improved estimation of $\beta$) improve, I might consider obtaining my standard errors differently (e.g. bootstrapping). No? And as pointed out in the link above, I fail to find evidence that the particular statistic I'm interested in, in the context I'm interested in, is affected. Happy to be proven wrong. – sheß Jul 20 '16 at 11:48
  • I could imagine that bootstrapping, doing the whole model building process on each of the bootstrap samples and then basing the estimate & SE (or confidence interval) on the estimates from the bootstrap samples could get you to something that might be valid. Do I know for sure it would have great properties... No. But, it certainly sounds like it would be more promising than what you originally suggested. At least it is not a case of doing model selection and then doing an analysis, as if there had never been any model uncertainty and the selected model had been pre-specified from the start. – Björn Feb 21 '17 at 18:16