Scoring rules for count models on: training data vs. validation data

Question

In order to evaluate and compare count models (e.g. Poisson regression), we can calculate scoring rules (e.g. Brier Score, Dawid-Sebastiani score, etc.) which are explained here: Error metrics for cross-validating Poisson models.

Should we calculate these scores using the data used for estimating the models (training data) or on a data subset that the models have not seen before (validation data)? Does doing the former lead to choosing models that are over-fitting and less generalizable? Is over-fitting necessarily a bad thing, if we are using the model only for inference?

score 0 · Answer 1 · answered May 20 '19 at 13:51

0

If you are using your model for inference, then you should either pre-specify the model completely (i.e., not use any model selection, whether based on a scoring rule, information criterion or other), or you need to correct your p values for the model selection step. Otherwise your p values will be biased low, since your selected model will contain predictors that are "useful", which means that their coefficients will be biased towards significance.

If you do so correct, then you can use scoring rules in-sample for model selection. (However, I do not know of any literature exploring p value correction for scoring rule based model selection.)

Of course, if you want to use your model for prediction, it makes more sense to evaluate your scoring rules on test data.

answered May 20 '19 at 13:51

Stephan Kolassa

95,027
13
197
357

1

There is no theory for what I am modeling, and I had a large number of variables (35ish) to choose from to include in the model. I used all-subset variable selection to minimize AICc (all subsets were 2^35 = 34*10^9, and impractical, so I used a genetic algorithm through the R package _glmulti_). I am using the final subset of explanatory variables, with different model specifications (e.g. different types of random effects specification, spatial vs. non-spatial, etc.). To choose among those models, I am trying to use scoring rules. Do you think I still need p-value correction? – Fred May 20 '19 at 14:19
1

Indeed I think you do. Have you tried simulating data with zero predictor effects whatsoever and putting this through your workflow? Do this multiple times, with different RNG seeds. I strongly suspect you will get quite a number of "significant" effects. – Stephan Kolassa May 20 '19 at 14:32
Can you explain a little more please? What do you mean by "simulating data with zero predictor effects whatsoever and putting this through your workflow"? – Fred May 20 '19 at 14:38
Take your predictors. Simulate an outcome that has *nothing* to do with your predictors. (For instance, you could randomly permute your outcome.) Then there is no significant relationship, right? But since your workflow actively looks for significant relationships, it will find them, just because of the inherent variability. This will be spurious, and a result of the bias. – Stephan Kolassa May 20 '19 at 15:01
Okay, gotcha. That would be interesting to see. Do you know of a good reference for p-value correction for variable selection in general? – Fred May 20 '19 at 15:11
1

Unfortunately, no. [It's a difficult problem.](https://stats.stackexchange.com/a/20856/1352) A [search on CV](https://stats.stackexchange.com/search?q=p+value+correction+variable+selection) does not really help. ... – Stephan Kolassa May 20 '19 at 15:20
1

... A permutation test might be your best bet: do the procedure [I outlined](https://stats.stackexchange.com/questions/409213/scoring-rules-for-count-models-on-training-data-vs-validation-data/409215?noredirect=1#comment764522_409215) 10,000 times and note the parameter estimates for *all* predictors (predictors that drop out in one run get a zero parameter). You now have a distribution of the parameter estimates under the null hypothesis of no relationship *which takes your model building procedure into account*. ... – Stephan Kolassa May 20 '19 at 15:22
1

... Now run your procedure on your original dataset. All parameter estimates that fall in the tails of the permutation based distributions are significant. The drawback: you need to run your model selection algorithm 10,000 times, which may be impossible if it takes a couple of minutes. (This is trivially parallelizable, though.) There are a couple of relevant books by Good, *Resampling Methods* and *Permutation, Parametric, and Bootstrap Tests of Hypotheses*. – Stephan Kolassa May 20 '19 at 15:25

Scoring rules for count models on: training data vs. validation data

1 Answers1