p value vs prediction error

Question

In a lot of fields (like medicine) to check if a variable is related to an output is controlled if the p-value of that variable in a regression model is significant.

For example:

> summary(glm.D93)

Call:
glm(formula = counts ~ outcome + treatment, family = poisson())

Deviance Residuals: 
       1         2         3         4         5         6         7         8         9  
-0.67125   0.96272  -0.16965  -0.21999  -0.95552   1.04939   0.84715  -0.09167  -0.96656  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.045e+00  1.709e-01  17.815   <2e-16 ***
outcome2    -4.543e-01  2.022e-01  -2.247   0.0246 *  
outcome3    -2.930e-01  1.927e-01  -1.520   0.1285    
treatment2   1.189e-15  2.000e-01   0.000   1.0000    
treatment3   8.438e-16  2.000e-01   0.000   1.0000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 10.5814  on 8  degrees of freedom
Residual deviance:  5.1291  on 4  degrees of freedom
AIC: 56.761

Number of Fisher Scoring iterations: 4

>

Wouldn't be better instead of using this approach to try all the possible combinations of variables and choose the model with the lowest prediction error? We can so assume that all the variables in such models are so relevant.

If this is not the case can you explain me why?

Our site contains extensive materials on this topic. You might like to start at http://stats.stackexchange.com/questions/128616/whats-a-real-world-example-of-overfitting, which provides actual examples of the problems that ensue from this approach. — whuber, Jan 13 '15 at 18:28
I am not sure that overfitting is relevant here. In fields like epidemiology — Donbeo, Jan 14 '15 at 02:19
The answers to your question might depend on precisely what you mean by "prediction error." How are you computing that? — whuber, Jan 14 '15 at 02:22
My idea is to get an estimation of the prediction error with cross-validation or bootstrap — Donbeo, Jan 14 '15 at 03:10
So you look for a definition of prediction error yourselves? The problem is that the prediction error we hope to ensure in reality can only be estimated according to a certain model and will in turn only be as good as this model matches reality. What we never know. Cross-validation or bootstrap cannot overcome this. These procedures do not validate the variable selection ifself but rather that the model construction procedure did not depend too much on the particular (training) data. — Horst Grünbusch, Jan 14 '15 at 10:27
this is true. For example we can never assume that a model using variable 1,2,3 would be better than a non linear model using variables 3,4,5. But also the same problem holds for the p-value because they are used only in linear model. — Donbeo, Jan 14 '15 at 12:11

Horst Grünbusch · Answer 1 · 2015-01-13T23:01:04.497

You get the smallest prediction error if you include all variables. If you also include quadratic or cubic terms you will become even better. If you generate random numbers for each observation and use them as a additional independent variable, you may reduce the prediction error even further. Just try it in R with your dataset.

The regression coefficients are always chosen such that they minimize the prediction error. So the more coefficients you have (due to more independent variables), the more possibilities you have to reduce the prediction error. As you have seen with adding random numbers, these possibilities don't need to have any relation to your data, yet they yield lower errors.

There are various ways to overcome this. First of all, don't include variables in your model where you don't have any idea how they could be related to the dependent variable. It is not the availability of data that shapes the model but the insight into the fact behind them.

Secondly, consider using information criteria, like the AIC printed in your output, to find some balance between prediction error and numbers of variables. Choosing only significant variables can be dangerous, since you don't know the type-II-error (test result not significant although the variable makes a difference), so the probability of throwing away important variables can be quite hight.

There are of course many more useful ways to judge your model selection process. The best way to choose variables or model selection procedures depends on the purpose of your regression analysis.

I do not agree the prediction error is not reduced with the inclusion of additional variables but only the training error. This is why I want to choose the model that minimize the test error — Donbeo, Jan 14 '15 at 00:51
I think the term 'prediction error' is being used by the OP to mean the 'out-of-sample' prediction error. This does not necessarily get smaller if one includes all variables: it usually reaches a minimum at some number of variables, then increases as the number of variables continue to increase. Including all variables leads to overfitting, leading to poor 'prediction error'. — ClarPaul, Mar 15 '17 at 01:18

score 1 · Answer 2 · answered Jan 13 '15 at 19:16

Your question is essentially about model selection.

When you are building a statistical model, you might not want to just consider the predictive ability of your model. Conventionally, the goodness of a statistical model is evaluated by the following three attributes.

Parsimony or Interpretability, i.e., the simplicity of your model. A parsimonious model usually have better interpretations and many other advantages.

Everything should be made as simple as possible, but no simpler. – Albert Einstein

Goodness-of-fit, i.e., how good your model fits the current data at hand.
Generalizability, that is, the ability of the fitted model to describe or predict new unknown data.

Because of the above, many model selection criteria have been proposed to address the model selection problems in different aspects.

Above all, it should be pointed out that conducting variable selection solely based on the significance level (p value) of a variable can cause a lot of issues. The following is quoted from a report "Scientific method: Statistical errors" published in Nature. The paper addresses some serious problems in scientific research caused by the p-value criterion.

P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. ...... Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously. ...... “That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05”, and “She is a p-hacker, she always monitors data while it is being collected.”

I am aware of the various model selection algorithms. In general in medicine is used only a small subset of variables let say 8. It is so possible to train all the possible combination of variables and to estimate their prediction perfomance. Considering linear model that seems to be the standard in medicine interpretabiliy should be fine and so variables would be chosen by prediction power. — Donbeo, Jan 14 '15 at 00:58

score 1 · Answer 3 · answered Jun 26 '17 at 07:28

You are absolutely right that deciding what terms in a model are relevant by looking at p-values (or AIC or BIC) - or even worse if this is done iteratively by adding and removing terms using e.g. stepwise regression - is not a really good approach for getting the best performing model by almost any standard (and certainly not for out of sample prediction). Perhaps with a single model fitted (from which non-significant terms are not removed), a huge sample size compared to the number of model terms, some sensible multiplicity adjustment and no collinearity, one might look at this as indicating for which terms the evidence is the clearest that they influence the outcome (without necessarily ruling out that other terms also play a role). That's about the most charitable interpretation and may make some sense, if you are not really that interested in prediction.

These kind of approaches seem to come from a traditional obsession with p-values and are still somewhat prevalent in some fields. However, even in medical applications universities and companies that have good statistics departments/support typically no longer do that.

As you suggest approaches like trying all models with some kind of cross-validation or bootstrapping to try to compensate for the potential overfitting is an improvement on stepwise regression. However, there are many more attractive approaches that are (rightly) gaining in popularity such as Bayesian model averaging, shrinkage priors (such as the horseshoe), LASSO (as an "older" idea), frequentist model averaging (e.g. using weights proportional to exp(-AIC/2)) and various other machine learning type of approaches (e.g. random forests). Many of these have in common that the goal is not to pick one single model, but rather to get good out-of-sample predictions taking into account the uncertainty around what the optimal model is. E.g. in model averaging approaches all models will usually contribute somewhat in real-life examples and situations where one single is clearly better than any other considered model are rare.

p value vs prediction error

3 Answers3