16

From what I understand, variable selection based on p-values (at least in regression context) is highly flawed. It appears variable selection based on AIC (or similar) is also considered flawed by some, for similar reasons, although this seems a bit unclear (e.g. see my question and some links on this topic here: What exactly is "stepwise model selection"?).

But say you do go for one of these two methods to choose the best set of predictors in your model.

Burnham and Anderson 2002 (Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, page 83) state that one should not mix variable selection based on AIC with that based on hypothesis testing: "Tests of null hypotheses and information-theoretic approaches should not be used together; they are very different analysis paradigms."

On the other hand, Zuur et al. 2009 (Mixed Effects Models With Extensions in Ecology with R, page 541) seem to advocate the use of AIC to first find the optimal model, and then perform "fine tuning" using hypothesis testing: "The disadvantage is that the AIC can be conservative, and you may need to apply some fine tuning (using hypothesis testing procures from approach one) once the AIC has selected an optimal model."

You can see how this leaves the reader of both books confused over which approach to follow.

1) Are these just different "camps" of statistical thinking and a topic of disagreement among statisticians? Is one of these approaches simply "outdated" now, but was considered appropriate at the time of writing? Or is one just plain wrong from the start?

2) Would there be a scenario in which this approach would be appropriate? For example, I come from a biological background, where I am often trying to determine which, if any, variables seem to affect or drive my response. I often have a number of candidate explanatory variables and I am trying to find which are "important" (in relative terms). Also, note that the set of candidate predictor variables is already reduced to those considered to have some biological relevance, but this may still include 5-20 candidate predictors.

Tilen
  • 740
  • 7
  • 18
  • 4
    I wonder what Zuur's statistical argument would be for fine tuning with hypothesis testing after AIC selection. It does not seem like a coherent strategy of model building. But I do not know enough about those things. – Richard Hardy Mar 15 '17 at 18:19
  • 2
    My hunch is that Zuur et al.'s suggestion is bad (why would you ever use significance tests for model selection?), although I'm not sure Burnham and Anderson's statement is correct, either. It's a good question, but I would have to read more deeply of the technical details than I've read so far in order to answer it. – Kodiologist Mar 16 '17 at 00:18
  • I've used both methods in models to predict panel sales. AIC based stepwise backward regression seemed to give better results from my experience. – Souptik Dhar Jun 24 '18 at 15:57
  • 1
    @SouptikDhar, when you say "better" results, in which way exactly do you mean? – Tilen Jun 27 '18 at 09:48
  • Maybe the answer is dependent on the objective of the analysis ? In an observational study, it could be desireable to find the most parcimonious model given the dataset, thus relying on "variable selection based on AIC" for example. However, if the aim is to put an hypothesis to the test, then the model, being a traduction of the hypothesis in term of adequate proxies for the variables of interest to our hypothesis, is already specified from the beginning so there is no room for variable selection IMHO ? – Rodolphe Sep 05 '18 at 22:21

2 Answers2

11

A short answer.

The approach of doing data-driven model selection or tuning, then using standard inferential methods on the selected/tuned model (à la Zuur et al., and many other respected ecologists such as Crawley), will always give overoptimistic results: overly narrow confidence intervals (poor coverage), overly small p-values (high type I error). This is because standard inferential methods assume the model is specified a priori; they don't take the model tuning process into account.

This is why researchers like Frank Harrell (Regression Modeling Strategies) strongly disapprove of data-driven selection techniques like stepwise regression, and caution that one must do any reduction of the model complexity ("dimension reduction", e.g. computing a PCA of the predictor variables and selecting the first few PCA axes as predictors) by looking only at the predictor variables.

If you are interested only in finding the best predictive model (and aren't interested in any kind of reliable estimate of the uncertainty of your prediction, which falls in the realm of inference!), then data-driven model tuning is fine (although stepwise selection is rarely the best available option); machine learning/statistical learning algorithms do a lot of tuning to try to get the best predictive model. The "test" or "out-of-sample" error must be assessed on a separate, held-out sample, or any tuning methods need to be built into a cross-validation procedure.

There does seem to have been historical evolution in opinions on this topic; many classic statistical textbooks, especially those that focus on regression, present stepwise approaches followed by standard inferential procedures without taking the effects of model selection into account [citation needed ...]

There are many ways to quantify variable importance, and not all fall into the post-variable-selection trap.

  • Burnham and Anderson recommend summing AIC weights; there's quite a bit of disagreement over this approach.
  • You could fit the full model (with appropriately scaled/unitless predictors) and rank the predictors by estimated magnitude [biological effect size] or Z-score ["clarity"/statistical effect size].
Ben Bolker
  • 34,308
  • 2
  • 93
  • 126
-2

I come from a biological background and am a biostatistician, working in in a university hospital. I read a lot on this, especially recently, including especially Harrell's opinions over the www, and his book Regression Modelling Strategies. Not referencing him anymore, but speaking from experience:

Primarily it's based on the current issue, which methods can even be used. Often data is highly correlated, so that no reasonable or reproducible "predictive" model could be found.

Second would be to get a good rational approach, so that your covariates/confounders should be intrepretable meaningful to express your predicted variable, by scientific experience.

Third would be to account for interactions (representable for non-linearities), which can be crucial, and often void any of the modern variable selection approaches

Only 4th is the actual method chosen, in my case with hospital data, that often has about x10^3 patients and x10^1 findings (e.g. deaths) in binomial logistic or semi-parametric cox regression, I used backwards stepwise AIC, Lasso, Ridge (no true variable selection) and Elastic Net regression against each other by AUC and admit: The methods vary per topic requested, but stepwise AIC is handling the issue aswell as Lasso and Elastic Net. Often doctors would report the AIC model as more reasonable.

(Sorry my first answer was very quickly typed)

And for the edit: I also tried eigenvalue-ranking by PCA, which gave most promising and reliable results for a general issue, but was still very far initial (and "reasonable") guesses.

Nuke
  • 87
  • 11
  • (-1). This answer seems rather anecdotal. Do you have any empirical evidence supporting your claim that stepwise AIC performs as well, let alone better than regularization? You seem to contradict everyone else on this. – Frans Rodenburg Feb 09 '21 at 07:27
  • It seems rather anecdotal to just cite "everyone else" either. I do not have empirical evidence as said, and when asking for model selection there might be no "true" empirical modell at all. However working with glmnet and especially pointing to the parameter nfold, one can find that the estimates can vary strongly by the number of bootstraps / crossvalidations. This indicates, that for some settings of case-control (as described above) no safe bias-variance tradeoff by methods like ridge and lasso can be made. This is not generally true but can be assessed for each hypothesis. Thank you – Nuke Feb 09 '21 at 10:36
  • 1
    With thousands of variables and tens of observations, how can you hope for *either* LASSO or best-subset to be stable? You simply don't have enough data to choose any sensible number of variables. You could use ridge regression and leave all the variables in, or make some preselection based on expert knowledge, you could perform dimension reduction without using the outcome variable, but there is no way that such a small number of observations can give you sufficient information on which variables to choose. Did you perhaps evaluate the AUC on the same data used to fit the model? – Frans Rodenburg Feb 09 '21 at 12:04
  • Dear Frans, a reasonable and elaborate answer. I contradict that gene expression profiles do handle this setting, but have edited my answer to be more clear. The second part is in fact true, I did not do a test-training comparison as should be done. One cannot hope to gain a more efficient AUC (nor is it meaningful) even when optimising glmnet by AUC (instead of deviance), however using all 4 steps above you can compare your results with those of scientist, and for no increased AUC I found the professional notion did more often correlate with those based on AIC. – Nuke Feb 09 '21 at 13:32
  • These findings are not reported, nor reproducible, but a mere reflection of daily work with these techniques to stick a warning label to all variable selection methods including those "new" ones based on regularisation. For time being I changed for univariate effect size differences and multivariate PCA and see not much improved correlation with a scientist notion either. Anyway thanks for reminding of a more precise language. – Nuke Feb 09 '21 at 13:33