Determine Predictors Based on Logistic Regression p-values

Question

I'm learning about logistic regression and have come across an interesting idea. Say I am given a dataset such as ISLR::default which comes from ISLR2 in R. I then make a model using all predictors and call a summary on the model to see the p-values of the predictors. Is it then valid to include or disclude certain predictors from my next model based on the significance of the predictors in the first model?

You deserve a real answer that I cannot give right now, but this is problematic. For example, what happens when you get “insignificant” predictors in your next model? — Dave, Dec 12 '21 at 22:11
Please see [this page](https://stats.stackexchange.com/q/20836/28500) for reasons why this type of thing is not a good idea. — EdM, Dec 12 '21 at 22:43
I would change the headline to remove "logistic" from it, because your question is equally relevant to any other multivariate regression — Aksakal, Dec 13 '21 at 14:25

Aksakal · Accepted Answer · 2021-12-13T16:12:47.027

Interesting question, because the theory would say that you shouldn't be doing this, yet in practice many people do it. For instance, in some fields the people who assess the use of models push for all predictors to have significant p-values. This, although technically doesn't require the same procedure that you suggested, often effectively leads to the same effect or even the same procedure of eliminating non-significant predictors.

Note, also that I'm not talking about techniques such as stepwise in SAS PROC REG, where overall model p-value is used to eliminate variables, not the individual t-test p-values. Those are a different kind of algorithmic approach.

In theory, p-value is a random number under the null hypothesis. In plain English, if your assumptions about the model correct, then the p-values are uniform random variables. They change sample to sample. Plus, although you are looking at an individual predictor, the p-values came from the multivariate model, where the joint distributions are involved. So, you should be worried about the overall suitability of the model not its single components, really.

One extreme example to highlight the problem with your approach is multicollinearity. Consider a model where two variables are highly correlated, this is very common in financial applications, e.g. stock returns and their volatility as predictors. When you include these predictors together in a model, they're likely to be both extremely highly significant, even if only one of them is really a driver of the response. In fact, it's a sign to be suspicious of the results and assess multicollinearity issue, and not the case to keep the variables in the predictors based on p-values.

In practice, people do all kinds of stuff that doesn't follow the theory. Sometimes due to the lack of knowledge, but also because "in theory" is itself under the assumptions, nulls etc., i.e. it's not a fact that theory is applicable in your particular case.

Geoffrey Johnson · Answer 2 · 2021-12-13T23:44:16.673

"All models are wrong but some are useful." The aim in any model building endeavor is to build a convenient yet perhaps imperfect representation of the data generative process that is useful. Even if there is a "right" model no person or metric can tell you what it is without error.

Before examining any multivariable model I strongly suggest investigating each covariate separately in univariable models. Your question appears to deal specifically with two multivariable models, so I will focus on this idea.

If both models are fit to the same data from a single experiment and the "next model" is a reduced/parsimonious version of the first model based on p-values then this is a perfectly good idea. It is imperative that you also look at the magnitude of the estimated association, i.e. the size of the parameter estimate. There may very well be an association in the population that you are unable to detect in your sample based on a significant p-value threshold. Your aim here is to identify a model that fits your data well and by extension fits the population well. Another metric you could use is AIC based on the likelihood. This will highlight the covariates that best fit your data but may not result in all of the significance tests yielding p-values below a certain threshold. However you identify a well-fitted model, this model is expected to do well when fitting the data from a repeated experiment, but this is not a guarantee. Many people will raise issue with this model-building approach with claims of multiple comparisons and increased family-wise error rates, or claims of "you can't do that with p-values." Here we are only concerned with estimated effect sizes and per-comparison error rates or p-values as evidence as tools for identifying associations so, "yes you can." Any good metric for model building will be based on the evidence contained in the likelihood, and the p-value as well as AIC represent this evidence.

If your "next model" concerns new observations from the same data generative process and you are pressure testing your fitted model in the first experiment by conducting a second experiment, then examining p-values is a perfectly good idea here too. This happens all the time in scientific endeavors. If repeated experiments manage to produce similar results using the same model we would feel that much more confident in the relationships/associations identified. If repeated experiments fail to produce similar results using the same model we would feel that much more confident that the relationships initially identified may not reflect the target population and were simply a result of random sampling.

Some view this process as leading to the so called "replication/reproducibility" crisis where subsequent experiments do not yield significant results similar to the first experiment. This isn't really a crisis if we are vigilant to remember that no hypothesis if proven false with a single small p-value nor is it proven true with a large one. All we can do is provide the weight of the available evidence.

It is also important to understand the nature of your data. Is it sampled in a way that is representative of the target population, or is there sampling bias? If there is an intervention of interest, is there selection bias or were the interventions randomly assigned? This will impact the claims you are making based on your model, i.e. association and causation. Here is a related thread.

No, the procedure is too unreliable for the first step. The data simply do not possess the information needed to make these discernments. The probability of choosing the right variables is almost zero. — Frank Harrell, Dec 13 '21 at 15:09
I'm thinking of searching for associations. We don't have to find all possible associations and any associations we do find would be tentative conclusions. Are we not supposed to look for associations and only divine them without observation? — Geoffrey Johnson, Dec 13 '21 at 15:32
I much prefer to have a model pre-specified based on scientific interest, but sometimes that scientific interest comes from identifying associations in observed data. — Geoffrey Johnson, Dec 13 '21 at 15:35
Interpreting type I error rates from model building definitely requires a retrospective interpretation $-$ had the discovered model been pre-specified then the type I error rate for each hypothesis investigated would have been controlled at level $\alpha$ and each hypothesis rejected. Of course the model wasn't pre-specified so all type I error guarantees are gone, but we can still interpret the p-values as ex-post sampling probabilities that point to areas of further scientific inquiry. — Geoffrey Johnson, Dec 13 '21 at 15:43
a more common approach is to look at F test p-value or other overall model metrics such as AIC to eliminate variable in an algorithmic manner. SAS stepwise regression is probably the most commonly used of such techniques. — Aksakal, Dec 13 '21 at 16:09
All these methods are highly problematic, and @GeoffreyJohnson p-values can't do what you are asking of them and type I probabilities are not that relevant anyway. To meet the goals you've listed, bootstrap the ranks for the variable importances (e.g. partial $R^2$). You'll find the confidence intervals are too wide to allow one to make firm choices about much of anything. — Frank Harrell, Dec 13 '21 at 18:15
In addition to the remarks made by @FrankHarrell, please note that excluding variables based on p-values alone risks eliminating entire groups of correlated variables that could be of the most predictive value, but whose p-values are inflated due to that correlation. In particular, there is no basis to interpret the p-values as "ex-post sampling probabilities"--such a characterization would appear to misrepresent what p-values are. For this reason alone, the procedure described in this answer must be considered suspect. — whuber, Dec 13 '21 at 18:55
@whuber and Frank Harrell, I suggested in other threads the examination of effect sizes as well as univariable models for the reasons you suggest, yet was lambasted their too. My description above does not preclude such examination. The characterization of the p-value as an ex-post sampling probability is by its very definition. If the p-value is not an ex-post sampling probability, what do you propose it is? — Geoffrey Johnson, Dec 13 '21 at 21:12
By definition, a p-value is a supremum of chances associated with the data under an assumed null hypothesis. Perhaps that's what you mean by "ex-post sampling probability," but it's hard to reconcile your terminology with that concept. — whuber, Dec 13 '21 at 22:22
Yes, by sampling probability I mean "chances." By ex-post I mean not dealing with forecasts. — Geoffrey Johnson, Dec 13 '21 at 23:50
The use of p-values and AIC in the manner described above are highly problematic. — Frank Harrell, Dec 15 '21 at 13:11

Determine Predictors Based on Logistic Regression p-values

2 Answers2