How does hypothesis testing with multiple regression relate to a reduced best subsets model?

Question

Let's say we have the model:

$Y_i=\beta_0+\beta_1 a_1+\beta_2 b_2+\beta_3 c_3+\beta_4 d_4+\beta_5 e_5+\beta_6 f_6+\beta_7 g_7$

The goal is descriptive - we want to find out which variables $x$ have a relationship with $Y$. For each variable $(a,b,...,f,g)$, we are testing the null hypothesis that $\beta=0$.

Now suppose we use the best subsets technique to reduce the model and find that an optimal reduced model is, say:

$Y_i=\beta_0+\beta_2 b_2+\beta_5 e_5+\beta_6 f_6$

We'll still have a model output with coefficients and p-values to conduct hypothesis testing for these $\beta$ values. But what about the discarded ones from the full model that are not in the reduced model? I'm pretty sure you can't just conclude they're insignificant just because they're not in the reduced model.

Or should you conduct the hypothesis testing on the full model, and then for the sake of forming a simple model you conduct best subsets?

score 2 · Accepted Answer · answered Nov 04 '21 at 04:31

There are a lot of problems to "best subsets" regression (also known as stepwise regression -- either forward or backward). Frank Harrell has outlined some of those problems here and in his book Regression Modelling Strategies.

In essence, the estimates and p values we obtain from the fitting procedure are not capable of conditioning on the selection of variables vis a vis stepwise regression. The resulting estimates and p values are biased upwards/downwards respectively. The most faithful procedure is to prespecify a model a priori and test the intended model, leaving in prespecified varibles even if we fail to reject the null, because as I note here, failing to reject the null does not mean the variables has 0 effect, it just means the effect is smaller than the associated uncertainty.

In short: Do not do best subsets regression if you intend to perform a hypothesis test. If you do perform best subsets regression, you should not evaluate any hypotheses and instead bootstrap your variable selection and report selection frequencies for transparency.

Can you elaborate on what you mean by "bootstrap your variable selection and report selection frequencies for transparency" ? — Machetes0602, Nov 04 '21 at 14:16

How does hypothesis testing with multiple regression relate to a reduced best subsets model?

1 Answers1