21

Let's take the following example:

set.seed(342)
x1 <- runif(100)
x2 <- runif(100)
y <- x1+x2 + 2*x1*x2 + rnorm(100)
fit <- lm(y~x1*x2)

This creates a model of y based on x1 and x2, using a OLS regression. If we wish to predict y for a given x_vec we could simply use the formula we get from the summary(fit).

However, what if we want to predict the lower and upper predictions of y? (for a given confidence level).

How then would we build the formula?

Tal Galili
  • 19,935
  • 32
  • 133
  • 195
  • The *Confidence Interval on New Observations* section of [this page](https://web.archive.org/web/20110719124646/http://www.weibull.com/DOEWeb/confidence_intervals_in_multiple_linear_regression.htm) may help. – GaBorgulya Apr 03 '11 at 18:32
  • @Tal Sorry, but it's not really clear to me what you actually mean by "predict the lower and upper predictions of y". Does it have something to do with prediction or tolerance bands? – chl Apr 03 '11 at 19:43
  • @Tal - a couple of queries. When you say " .. y based on x1 and x2, using a OLS regression." , you mean your create a linear model and _estimate parameters using OLS_. Am I right? and @chl's question -- do you want to predict the lower and upper bounds for the prediction interval? – suncoolsu Apr 03 '11 at 19:55
  • @chl, sorry for not being more clear. I am looking for two formulas that will give an interval for that will "catch" the "real" value of y 95% of the time. I feel how I'm using definitions for the CI for the mean, when there is probably some other term I should be using, sorry about that... – Tal Galili Apr 03 '11 at 20:02
  • @suncoolsu - yes and yes. – Tal Galili Apr 03 '11 at 20:03
  • Than you need the formula I suggested above. I will re-post it as an answer. – GaBorgulya Apr 09 '11 at 14:52
  • See the formula in the *Confidence Interval on New Observations* section (prediction interval) of http://www.weibull.com/DOEWeb/confidence_intervals_in_multiple_linear_regression.htm [(permalink)](http://www.webcitation.org/5xonEjGdb). – GaBorgulya Apr 09 '11 at 14:56

3 Answers3

29

You will need matrix arithmetic. I'm not sure how Excel will go with that. Anyway, here are the details.

Suppose your regression is written as $\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{e}$.

Let $\mathbf{X}^*$ be a row vector containing the values of the predictors for the forecasts (in the same format as $\mathbf{X}$). Then the forecast is given by $$ \hat{y} = \mathbf{X}^*\hat{\mathbf{\beta}} = \mathbf{X}^*(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} $$ with an associated variance $$ \sigma^2 \left[1 + \mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)'\right]. $$ Then a 95% prediction interval can be calculated (assuming normally distributed errors) as $$ \hat{y} \pm 1.96 \hat{\sigma} \sqrt{1 + \mathbf{X}^* (\mathbf{X}'\mathbf{X})^{-1} (\mathbf{X}^*)'}. $$ This takes account of the uncertainty due to the error term $e$ and the uncertainty in the coefficient estimates. However, it ignores any errors in $\mathbf{X}^*$. So if the future values of the predictors are uncertain, then the prediction interval calculated using this expression will be too narrow.

Rob Hyndman
  • 51,928
  • 23
  • 126
  • 178
  • 1
    +1, excellent answer. I should note though, that regression model always estimates conditional expectation, so it is as good as its regressors are. So the last comment although is very good, it is not strictly necessary, since if you build regression model you must trust the regressors. – mpiktas Apr 04 '11 at 03:51
  • 1
    why the 1 comes up in the formula? We have $\hat{y}=X^*\beta+X^*(X'X)^{-1}X'e$. Then $var \hat{y}=var X^*(X'X)^{-1}X'e=\sigma^2X^*(X'X)^{-1}(X^*)'$? – mpiktas Apr 07 '11 at 14:25
  • 1
    The 1 is for prediction intervals. Leave it off for confidence intervals. Var($\hat{y}$) relates to confidence intervals. – Rob Hyndman Apr 12 '11 at 14:44
  • @RobHyndman thank you for your excelent answer (one year ago ;)) however, am I missing something or is the term in the square root $N \times N$? – Seb Sep 28 '12 at 06:06
  • @Seb. $X^*$ is a row vector, so the term is scalar. – Rob Hyndman Sep 28 '12 at 23:26
  • @RobHyndman yes of course! sorry - i didn't realize – Seb Sep 29 '12 at 07:00
  • Hi, when calculating the prediction & confidence intervals, should we use the square root of variance $\sqrt{Var{(\hat{y})}}$ to multiply with the critical value or use the standard error $\frac{\sqrt{Var{(\hat{y})}}}{\sqrt{n}}$ to multiply with the critical value? – Stephen Ge Apr 02 '21 at 01:08
  • 1
    I did not provide the variance of y-hat, but the **forecast variance**. – Rob Hyndman Nov 25 '21 at 01:19
9

Are you by chance after the different types of prediction intervals? The predict.lm manual page has

 ## S3 method for class 'lm'
 predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf, 
         interval = c("none", "confidence", "prediction"),
         level = 0.95, type = c("response", "terms"),
         terms = NULL, na.action = na.pass,
         pred.var = res.var/weights, weights = 1, ...)

and

Setting ‘intervals’ specifies computation of confidence or prediction (tolerance) intervals at the specified ‘level’, sometimes referred to as narrow vs. wide intervals.

Is that what you had in mind?

Dirk Eddelbuettel
  • 8,362
  • 2
  • 28
  • 43
7

@Tal: Might I suggest Kutner et al as a fabulous source for linear models.

There is the distinction between

  1. a prediction of $Y$ from an individual new observation $X_{vec}$,
  2. the expected value of a $Y$ conditioned on $X_{vec}$, $E(Y|X_{vec})$ and
  3. $Y$ from several instances of $x_{vec}$

They are all covered in detail in the text.

I think you are looking for the formula for the confidence interval around $E(Y|X_{vec})$ and that is $\hat{Y} \pm t_{1-\alpha/2}s_{\hat{Y}}$ where $t$ has $n-2$ d.f. and $s_{\hat{Y}}$ is the standard error of $\hat{Y}$, $\frac{\sigma^{2}}{n} +(X_{vec}-\bar{X})^{2}\frac{\sigma^{2}}{\sum(X_{i}-\bar{X})^{2}}$

LudvigH
  • 150
  • 8
B_Miner
  • 7,560
  • 20
  • 81
  • 144
  • 1
    (+1) for making the distinction. However, I believe the OP is asking for (1), not (2) (and I have edited the question's title accordingly). Also note that your formula appears to assume the regression depends only on one variable. – whuber Apr 04 '11 at 14:48