2

This thread offers some good information on the calculation of linear regression prediction intervals and includes a link to some practical notes. At the end of the link the author claims that, in case of multiple linear regression, we should consider the joint region of the predictor variables.

How should MLR prediction intervals be interpreted in light of their joint region limitations?

Robert Kubrick
  • 4,078
  • 8
  • 38
  • 55

1 Answers1

3

The consideration in question refers to whether your model, which is going to be an approximation to reality, can still be considered a good approximation for the specified values of the input variables. It's more of a warning about the potential for poor model quality in a region of the data space for which you may have seen little or no data, thus making it possible that the model performs poorly there but you wouldn't know it from the (nonexistent) data. However, it's not a hard-and-fast rule; domain knowledge is important in this assessment.

For example, models of wage growth vs productivity growth and % employment developed using data from a period with more-or-less full employment may be very poor predictors of wage growth given a certain level productivity growth and % employment during a period with high unemployment. More simply, a linear approximation to $y = \sqrt{x}$ isn't bad when the range of $x$ is 1000 to 1001, but will produce very poor estimates when the input value of $x$ is, say, 500.

To the point of your question - if your new data is in a region where it's not clear to you that your model is as good as its overall fit would indicate, the prediction intervals are likely smaller than they ought to be, and should be interpreted with caution. If your new data is in a region where you have grave doubts about the model, best not to make any predictions at all - or load them up with caveats about the model's potential for error if for some reason you have to. (These statements are, of course, rules of thumb, and my own opinions.)

jbowman
  • 31,550
  • 8
  • 54
  • 107
  • I find this confusing because it seems to confound prediction error with "Type III" model mis-specification error. Possibly, my problem results from a different interpretation of the question: to me, the crux of the matter is that an MLR prediction interval (PI) combines two sources of variation: estimation error and prediction error (for an independent observation). The estimation error is expressed as a *multivariate* distribution of the parameters: exactly how is that distribution to be understood and how exactly is it related to the PI? – whuber Feb 09 '12 at 21:51
  • @whuber - when I clicked through to the link, it seemed to me, possibly erroneously, that the author was referring to the potential for model error outside of the region where data was observed. It also seemed to me that the version of the question which relates the multivariate distribution of the parameters to the prediction interval had been answered in the thread. I should probably have asked the OP for clarification of the question :) – jbowman Feb 09 '12 at 21:57
  • (+1) Ah, I didn't catch that meaning in the link: thanks for pointing it out. I took "joint region" to refer (perhaps) to a joint confidence region for the regression parameters, whereas in the link it appears to represent some region of applicability of the model. (Why it is drawn as an ellipse is a mystery to me, though, further adding to my confusion.) – whuber Feb 09 '12 at 22:03
  • @whuber Yes my question and my interpretation of the link is that it's relative to the prediction region. – Robert Kubrick Feb 10 '12 at 15:28