4

I have fitted a polynomial to a data set that I have. Thus I have obtained coefficients $\beta_i$ for $i=0,1,2$ and have a relationship of the form is $$Y=\beta_0+\beta_{1}X+\beta_{2}X^{2}+\varepsilon$$ between the predictor $X$ and the response $Y$, where $\varepsilon$ is the error term.

My question is about computing confidence intervals.
I can compute for each of the coefficients $\beta_i$ the confidence interval using the confint()function from R. This gets me three (95%) confidence intervals $[a_0,b_0]$,$[a_1,b_1]$ and $[a_2,b_2]$ for $\beta_0$,$\beta_1$ and $\beta_2$.

Lets say I would like to predict the average value that I will obtain for $Y$ for a new sample point $x=0.25$, that is, I would like to know the confidence interval around this point.
Using the previously obtained confidence intervals for coefficients, I can compute that $x$ has to lie in (with probability 95%) in the interval $$[a_0+a_{1}0.25+a_{2}0.25^{2},b_0+b_{1}\tilde 0.25+b_{2}0.25^{2}].$$

Unfortunately, when I compare the values that I obtain in this way with the output of the R command predict(my_fitted_data, data.frame(x=0.25),interval = 'confidence'), that interval is way smaller than the one I obtaind by computing the interval, as described, by hand.
Did I do anything wrong? If so, what?

Thank you.

l7ll7
  • 1,075
  • 2
  • 9
  • 15
  • 3
    The parameter estimates are correlated. ``vcov(my_fitted_data)`` shows you the variance-covariance matrix of the parameter estimates, ``cov2cor(vcov(my_fitted_data))`` the correlation matrix. Your approach ignores this correlation. – Wolfgang Jan 29 '17 at 21:32
  • @Wolfgang I don't really understand what that means (how to interpret the output of these commands), unfortunately....Is the take-home message: Use the R function for the correct confidence interval? – l7ll7 Jan 29 '17 at 21:40
  • 1
    @user10324 I asked a similar question a while ago. The answer there should give you what you need. http://stats.stackexchange.com/questions/10439/confidence-interval-for-difference-of-means-in-regression – mark999 Jan 29 '17 at 21:42
  • 1
    Adding to other comments, you are [misunderstanding confidence intervals](http://stats.stackexchange.com/questions/26450/why-does-a-95-confidence-interval-ci-not-imply-a-95-chance-of-containing-the), moreover, you seem to be asking about *prediction intervals*. – Tim Jan 29 '17 at 21:44
  • @mark999 that answer is unfortunately way too much over the top of my head... so it seems that I should just use the `R` command, as stated? – l7ll7 Jan 29 '17 at 21:44
  • @Tim No, that I way I emphasized the word "average". For a single point estimate indeed I would be interested in the prediction interval. – l7ll7 Jan 29 '17 at 21:46
  • @user10324 yes, your computation by-hand is incorrect and if you want to get proper intervals then better stick to the standard R functions. – Tim Jan 29 '17 at 21:46
  • @Tim thanks! ok, that is essentially what I wanted to know (as the details, as the linked answer shows, are too difficult for me to really understand without putting in a lot of work) – l7ll7 Jan 29 '17 at 21:47
  • @Tim I don't think the question is asking about prediction intervals. – mark999 Jan 29 '17 at 21:47
  • 1
    @user10324 still CI's do not give you "95% probability that x has to lie in the interval". – Tim Jan 29 '17 at 21:47
  • @mark999 maybe not, it is not clear for me from the question. – Tim Jan 29 '17 at 21:47
  • @Tim indeed, they give the interval where with 95% probability that x has to lie *on average* (i.e. sample a lot of x's, take the mean (so that the error term doesn't bear weight any more) and that lies in the CI) – l7ll7 Jan 29 '17 at 21:48
  • @user10324 I'll give you an example. – mark999 Jan 29 '17 at 22:03
  • @user10324 that's a different question to the one you posted. Don't ask entirely new questions in comments, post new questions (after searching as they're likely to be answered in some form already) instead. – Glen_b Jan 30 '17 at 00:57

1 Answers1

8

The mistake you made was to ignore the covariances between the estimators $\hat{\beta}_0$, $\hat{\beta}_1$ and $\hat{\beta}_2$.

You want a 95% confidence interval for $$ E(Y|X=0.25) = \beta_0 + 0.25\beta_1 + 0.25^2\beta_2 = a\beta, $$ with $a = [1\ \ 0.25\ \ 0.25^2]$ and $\beta = [\beta_0\ \ \beta_1\ \ \beta_2]^T$.

Under the usual assumptions (normally distributed errors, etc.), the confidence interval you want is $$ a\hat{\beta} \pm t_{n-p,\, 0.975} \sqrt{a\hat{\sigma}^2 (\mathbf{X}^T\mathbf{X})^{-1}a^T}, $$ with $n$ being the sample size, $p$ being the number of regression coefficients including the intercept (i.e. $p=3$), and $t_{n-p,\, 0.975}$ being the value such that $P(T_{n-p} \leq t_{n-p,\, 0.975}) = 0.975$ if $T_{n-p}$ is a Student t random variable with $n-p$ degrees of freedom. We use 0.975 because that gives 0.025 in the upper tail and 0.025 in the lower tail. The matrix $\hat{\sigma}^2 (\mathbf{X}^T\mathbf{X})^{-1}$ is the estimated covariance matrix of the estimators $\hat{\beta}$, which can be obtained in R using the vcov function ($\mathbf{X}$ is the design matrix).

I acknowledge NRH, who showed me this when I asked a similar question (the notation I've used here is slightly different).

An example:

set.seed(1)

d <- data.frame(x = rnorm(50), y = rnorm(50))

lm1 <- lm(y ~ x + I(x^2), data=d)

x.value <- 0.25
conf.level <- 0.95

predict(lm1, data.frame(x = x.value), interval="confidence", level = conf.level)

##         fit        lwr       upr
## 1 0.1254577 -0.2068385 0.4577539

a <- c(1, x.value, x.value^2)

# percentile of the t-distribution to use (0.975 if conf.level is 0.95)
perc <- 1 - (1 - conf.level)/2

# degrees of freedom = sample size - number of coefs
DF <- nrow(d) - length(coef(lm1))

# Confidence interval
(a %*% coef(lm1)) + c(-1, 1)*qt(perc, DF)*sqrt(a %*% vcov(lm1) %*% a)

## [1] -0.2068385  0.4577539
mark999
  • 3,180
  • 2
  • 22
  • 31