Obtaining a formula for prediction limits in a linear model (i.e.: prediction intervals)

Question

Let's take the following example:

set.seed(342)
x1 <- runif(100)
x2 <- runif(100)
y <- x1+x2 + 2*x1*x2 + rnorm(100)
fit <- lm(y~x1*x2)

This creates a model of y based on x1 and x2, using a OLS regression. If we wish to predict y for a given x_vec we could simply use the formula we get from the summary(fit).

However, what if we want to predict the lower and upper predictions of y? (for a given confidence level).

How then would we build the formula?

The *Confidence Interval on New Observations* section of [this page](https://web.archive.org/web/20110719124646/http://www.weibull.com/DOEWeb/confidence_intervals_in_multiple_linear_regression.htm) may help. — GaBorgulya, Apr 03 '11 at 18:32
@Tal Sorry, but it's not really clear to me what you actually mean by "predict the lower and upper predictions of y". Does it have something to do with prediction or tolerance bands? — chl, Apr 03 '11 at 19:43
@Tal - a couple of queries. When you say " .. y based on x1 and x2, using a OLS regression." , you mean your create a linear model and _estimate parameters using OLS_. Am I right? and @chl's question -- do you want to predict the lower and upper bounds for the prediction interval? — suncoolsu, Apr 03 '11 at 19:55
@chl, sorry for not being more clear. I am looking for two formulas that will give an interval for that will "catch" the "real" value of y 95% of the time. I feel how I'm using definitions for the CI for the mean, when there is probably some other term I should be using, sorry about that... — Tal Galili, Apr 03 '11 at 20:02
Than you need the formula I suggested above. I will re-post it as an answer. — GaBorgulya, Apr 09 '11 at 14:52
See the formula in the *Confidence Interval on New Observations* section (prediction interval) of http://www.weibull.com/DOEWeb/confidence_intervals_in_multiple_linear_regression.htm [(permalink)](http://www.webcitation.org/5xonEjGdb). — GaBorgulya, Apr 09 '11 at 14:56

Rob Hyndman · Accepted Answer · 2017-10-08T01:31:34.933

29

You will need matrix arithmetic. I'm not sure how Excel will go with that. Anyway, here are the details.

Suppose your regression is written as $\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{e}$.

Let $\mathbf{X}^*$ be a row vector containing the values of the predictors for the forecasts (in the same format as $\mathbf{X}$). Then the forecast is given by $$ \hat{y} = \mathbf{X}^*\hat{\mathbf{\beta}} = \mathbf{X}^*(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} $$ with an associated variance $$ \sigma^2 \left[1 + \mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)'\right]. $$ Then a 95% prediction interval can be calculated (assuming normally distributed errors) as $$ \hat{y} \pm 1.96 \hat{\sigma} \sqrt{1 + \mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)'}. $$ This takes account of the uncertainty due to the error term $e$ and the uncertainty in the coefficient estimates. However, it ignores any errors in $\mathbf{X}^*$. So if the future values of the predictors are uncertain, then the prediction interval calculated using this expression will be too narrow.

edited Oct 08 '17 at 01:31

answered Apr 04 '11 at 01:44

Rob Hyndman

51,928
23
126
178

1

+1, excellent answer. I should note though, that regression model always estimates conditional expectation, so it is as good as its regressors are. So the last comment although is very good, it is not strictly necessary, since if you build regression model you must trust the regressors. – mpiktas Apr 04 '11 at 03:51
1

why the 1 comes up in the formula? We have $\hat{y}=X^*\beta+X^*(X'X)^{-1}X'e$. Then $var \hat{y}=var X^*(X'X)^{-1}X'e=\sigma^2X^*(X'X)^{-1}(X^*)'$? – mpiktas Apr 07 '11 at 14:25
1

The 1 is for prediction intervals. Leave it off for confidence intervals. Var($\hat{y}$) relates to confidence intervals. – Rob Hyndman Apr 12 '11 at 14:44
@RobHyndman thank you for your excelent answer (one year ago ;)) however, am I missing something or is the term in the square root $N \times N$? – Seb Sep 28 '12 at 06:06
@Seb. $X^*$ is a row vector, so the term is scalar. – Rob Hyndman Sep 28 '12 at 23:26
@RobHyndman yes of course! sorry - i didn't realize – Seb Sep 29 '12 at 07:00
Hi, when calculating the prediction & confidence intervals, should we use the square root of variance $\sqrt{Var{(\hat{y})}}$ to multiply with the critical value or use the standard error $\frac{\sqrt{Var{(\hat{y})}}}{\sqrt{n}}$ to multiply with the critical value? – Stephen Ge Apr 02 '21 at 01:08
1

I did not provide the variance of y-hat, but the **forecast variance**. – Rob Hyndman Nov 25 '21 at 01:19

score 9 · Answer 2 · answered Apr 03 '11 at 21:04

9

Are you by chance after the different types of prediction intervals? The predict.lm manual page has

 ## S3 method for class 'lm'
 predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf, 
         interval = c("none", "confidence", "prediction"),
         level = 0.95, type = c("response", "terms"),
         terms = NULL, na.action = na.pass,
         pred.var = res.var/weights, weights = 1, ...)

and

Setting ‘intervals’ specifies computation of confidence or prediction (tolerance) intervals at the specified ‘level’, sometimes referred to as narrow vs. wide intervals.

Is that what you had in mind?

answered Apr 03 '11 at 21:04

Dirk Eddelbuettel

8,362
2
28
43

Hi Dirk, that is indeed what I wish to find, but I want the upper and lower bonds to be in the form of a *formula* (so to later implement in some low form of statistical software, for example, excel...) – Tal Galili Apr 03 '11 at 22:16
p.s: I now see that there was an edit to the title of my question that might had led you to think I was asking about predict.lm interval parameter (which I am not) :) – Tal Galili Apr 03 '11 at 22:18
8

You are abusing terminology here. Excel is not statistical software. – Dirk Eddelbuettel Apr 03 '11 at 22:23
1

You're right, my bid, how about "a spreadsheet application" ? – Tal Galili Apr 03 '11 at 22:42
3

I can live with that; it calls the devil by its name ;-) – Dirk Eddelbuettel Apr 03 '11 at 22:52

score 7 · Answer 3 · edited Sep 27 '21 at 14:11

7

@Tal: Might I suggest Kutner et al as a fabulous source for linear models.

There is the distinction between

a prediction of $Y$ from an individual new observation $X_{vec}$,
the expected value of a $Y$ conditioned on $X_{vec}$, $E(Y|X_{vec})$ and
$Y$ from several instances of $x_{vec}$

They are all covered in detail in the text.

I think you are looking for the formula for the confidence interval around $E(Y|X_{vec})$ and that is $\hat{Y} \pm t_{1-\alpha/2}s_{\hat{Y}}$ where $t$ has $n-2$ d.f. and $s_{\hat{Y}}$ is the standard error of $\hat{Y}$, $\frac{\sigma^{2}}{n} +(X_{vec}-\bar{X})^{2}\frac{\sigma^{2}}{\sum(X_{i}-\bar{X})^{2}}$

edited Sep 27 '21 at 14:11

LudvigH

150
8

answered Apr 04 '11 at 01:15

B_Miner

7,560
20
81
144

1

(+1) for making the distinction. However, I believe the OP is asking for (1), not (2) (and I have edited the question's title accordingly). Also note that your formula appears to assume the regression depends only on one variable. – whuber Apr 04 '11 at 14:48

Obtaining a formula for prediction limits in a linear model (i.e.: prediction intervals)

3 Answers3

Linked

Related