1

Suppose I want to find a linear model with Gaussian error for a given data set. (The data set contains insurance claims and the end goal is to predict claim cost from claim features.) Also, suppose that I use some sort of stepwise model selection method, let's say based on minimizing AIC, to select the variables to include in the model. The main focus is on predictions and interpretation of the coefficients would be a plus.

My question is: when do I check that the model assumptions are valid? before/after or at each step of the stepwise procedure?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
lalessandro
  • 229
  • 1
  • 9
  • 3
    Stepwise selection is not recommended, especially for interpretation. What kind of data do you have and what is the end goal? – user2974951 Sep 16 '19 at 06:17
  • @user2974951 the data set contains insurance claims and the end goal is to predict claim cost from claim features. – lalessandro Sep 16 '19 at 06:24
  • 2
    Then why are you doing stepwise selection? If your end goal is prediction you may be better off including all the variables, unless you have an enormous amount of them. – user2974951 Sep 16 '19 at 12:18
  • Could you expand a bit on why this is the case? – lalessandro Sep 16 '19 at 16:09

1 Answers1

4

The basic assumptions of a linear model (independence, homoscedasticity, and normality of errors; also see here) primarily apply to whether inferences about the parameter values are valid (i.e., are the p-values right). For example, ordinary least squares regression is unbiased even if the errors are not normally distributed, or are only uncorrelated (but not independent). Using stepwise selection routines guarantees that your inferences are not valid, however, so it's harder to see what the need is.

These assumptions also don't have much to do with the out of sample predictive accuracy of your model, which you state is what's most important to you. That should be assessed using other means, such as cross-validation. I should note here that stepwise methods are typically quite poor for predictive modeling as well (see, e.g., my answer here: Algorithms for automatic model selection).

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 1
    In addition, a linear model with Gaussian errors is unlikely to be a good way to predict claims payouts, since claims can't be negative & will be zero-inflated. – gung - Reinstate Monica Sep 16 '19 at 20:24
  • Sure I understand now. A couple of follow up questions: 1) does this mean that when looking for a prediction focused model all we need to check is that the estimated out if sample prediction error is low? 2) could you give an example of two models, one of which has assumptions validated by the data but poor predictive power and one which doesn't have assumptions validated by the data but has good predictive power? – lalessandro Sep 17 '19 at 06:11
  • @alessandro, you should probably ask those as new questions. (1) You may do as you like, but if you are building a model for prediction, it seems to me you would care how well it predicts, not about other stuff. (2) It isn't that having valid assumptions & good predictive ability are mutually exclusive, or even orthogonal. Nonetheless, they aren't the same thing & ultimately you need to decide what you want to optimize: a property you care about, or 1 you don't (cf, [When will a less true model predict better than a truer model?](https://stats.stackexchange.com/q/22566/7290)). – gung - Reinstate Monica Sep 17 '19 at 17:54
  • Again, the point is that the assumptions are needed for valid inference, but if you are using stepwise selection, your inferences are not valid either way. If you feel like checking them just because you have some time to kill, go ahead. – gung - Reinstate Monica Sep 17 '19 at 17:55