Model selection basics

Question

I have ran three regressions and they are all very statistically significant.

How do I choose which one is the best to use?

i.e do I look for high F-statistic, low p-values ect.

There are many techniques. But the most important thing is your approach. You can let the data drive the decision, via model fit statistics, stepwise regression, comparison of likelihood functions etc. Or you can let the theory drive the decision. Personally I think theory should be what determines your decision to include or exclude variables. — llewmills, Aug 13 '17 at 22:15

score 1 · Answer 1 · answered Aug 13 '17 at 22:35

The general approach to model selection involves assessing the accuracy of a model when fit to previously unseen data. This is the rationale behind use of training and test data sets - models are first fit to training data sets, and the model that produces the most accurate predictions when applied to test data sets is then chosen as "best".

In order to evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. In the regression setting, the most commonly-used measure is the mean squared error (MSE), given by $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{f}(x_{i}))^2$$ The MSE is computed using the training data that was used to fit the model, and so should more accurately be referred to as the training MSE. But in general, we do not really care how well the method works on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.^[1]

Cross-validation can be employed to evaluate model performance to aid in the model selection process:

Cross-validation in plain english

Meaning of cross-validation

How to choose a predictive model after k-fold cross-validation?

_{1. An Introduction to Statistical Learning with Applications in R}

score -2 · Accepted Answer · answered Aug 13 '17 at 22:40

-2

Best to use for what purpose? If it's purely for prediction and you don't care about explanation, you can go with the one that has the lowest AIC, BIC, SBC or some similar score.

If it's for explanation, then go with the one that best advances the field you are researching.

answered Aug 13 '17 at 22:40

Peter Flom

94,055
35
143
276

1

Could you please clarify for me what "the one that best advances the field you are researching" means. Thanks! – user795305 Aug 14 '17 at 04:05
I'm not sure how to make it clearer. Look at the different models. Which one helps your field the most? Which answers your research questions best? etc. – Peter Flom Aug 14 '17 at 11:00
2

I'm interpreting that as choosing a model to get whichever answer you wanted before the experiment. I'm interpreting that as suggesting, for instance, that if I know that my field would be very interested in knowing that that a covariate is significant, I should select that covariate. I very well could be misunderstanding, but it seems that this procedure (or others like it) is astatistical. – user795305 Aug 14 '17 at 14:28

Model selection basics

2 Answers2