3

I would like to know if my simulation approach to find the coverage for a confidence interval of a prediction $\boldsymbol{\beta}^T\boldsymbol{X}_N$ is correct

  1. I generated a dataset of $n$ samples of covariates $\boldsymbol{X} \in \mathbb{R}^p$ and $Y \in \mathbb{R}$ that follow the linear model $Y_i = \boldsymbol{\beta}^T\boldsymbol{X}_i + \varepsilon_i$ for $i=1,\dots,n$. So I have a design matrix $\mathbb{X} \in \mathbb{R}^{n \times p}$ and a response vector $\mathbb{Y} \in \mathbb{R}^{n}$. Here I set $n = 512$ and $p = 1024$. (the data were generated as a multivariate standar normal)
  2. I created a new independent observation $\boldsymbol{X}_N \in \mathbb{R}^p$, $\boldsymbol{X}_N \sim N(0,I_p)$
  3. Compute $\widehat{\boldsymbol{\beta}} \in \mathbb{R}^p$ for the linear model
  4. Find the true value of $\boldsymbol{\beta}^T\boldsymbol{X}_N$ (since I can compute the true parameter $\boldsymbol{\beta}$)
  5. Compute an estimator of the variance $\hat{V} = \text{var}(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N)$
  6. Compute the confidence interval for $\boldsymbol{\beta}^T\boldsymbol{X}_N$ as $(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N \pm z_{\alpha/2}\hat{V}^{1/2})$ assuming asymptotic normality.

Now I'm not sure how to proceed. Should I repeat the process from (3) or generate another dataset? Any help would be much appreciated.

Edit: I'm interested in the behavior of $\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N$ since this is a univariate term. The new observation $\boldsymbol{X}_N$ is fixed once is generated. So yes, I should say prediction inverval, but the rest of the question remains.

Ejrionm
  • 157
  • 5
  • 2
    You’re mixing up confidence intervals and prediction intervals. – Dave Aug 20 '21 at 14:11
  • 3
    Are you interested in determining the coverage of the prediction for the true mean, or are you interested in determining the coverage of the prediction on a new observation – Demetri Pananos Aug 20 '21 at 14:23
  • @Dave Edited, thanks – Ejrionm Aug 20 '21 at 14:59
  • @DemetriPananos Edited. I'm interested in the coverage of the prediction on a new observation. Since in my work $n,p$ would change (and so the dimension of the parameter and data) we need something that's univariate – Ejrionm Aug 20 '21 at 15:00
  • I don't follow this, because it looks like if you iterate starting at (3), nothing changes. You will learn something interesting if you do this in two ways: one iterating at (2) and the other iterating at (1). (This reflects the difference between a model with fixed explanatory variables and with random variables.) The interesting result is that the answer doesn't matter. – whuber Aug 20 '21 at 17:01
  • If you are interested in the coverage prediction of a new observation, why aren't you adding the variance of $\epsilon$ in 6? – Luca Citi Aug 21 '21 at 14:40
  • @Ejrionm Please unaccept my answer and accept Jarle's – Demetri Pananos Aug 21 '21 at 16:16

3 Answers3

5

EDIT: I have removed my code because Jarle has pointed out that coverage statements are conditional on the predictor of a new point remaining fixed. My simulation was hence incorrect, and I encourage readers to read his answer instead.

You are describing the process for validating coverage of a confidence interval. This will not give you 95% coverage of a new observation. The reason is easy to argue.

The width of the confidence interval is inversely proportional to $\sqrt{n}$. Quadruple the sample and, ceteris paribus, the width of the interval decreases by a factor of 2. This means we can make the interval arbitrarily small by taking more samples. However, taking more data does not change the data generating process; the residual variance remains unchanged. This means the coverage of the confidence interval for a new observation will be woefully below the nominal because a confidence interval is a measure of uncertainty at the level of the estimate, not of the data generating process.

A prediction interval incorporates uncertainty in the estimate and the data generating processes. Here is a good tutorial on prediction intervals for OLS in R and python.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • Thanks for your answer. My situation is this: we are working in the context where $n$ and $p$ increase (of course, in real life these are fixed but we're interested in the case $n < p$) and we found that for any $\boldsymbol{X}_N \in \mathbb{R}^p$ it holds $V^{-1/2}(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T \boldsymbol{X}_N \leadsto N(0,1)$ (it doesn't matter if this is the OLS estimator or other thing) where $V$ is the asymptotic variance. We worked with $\boldsymbol{\beta}^T\boldsymbol{X}_N$ because this remains univariate when $p \rightarrow \infty$. – Ejrionm Aug 20 '21 at 17:56
  • Of course, from this we can say that an interval for $\boldsymbol{\beta}^T\boldsymbol{X}_N$ is $(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N \pm z_{\alpha/2}\hat{V}^{1/2})$. So now we want to assess this interval by computing the coverage, I simulated a dataset just for the sake of simplicity – Ejrionm Aug 20 '21 at 17:56
  • 1
    @Ejrionm I understand what your goal is, but I'm saying the confidence interval is the wrong estimator. It will not give you the coverage you seek for a new observation. – Demetri Pananos Aug 20 '21 at 17:58
  • So, what is the correct approach to assess this? – Ejrionm Aug 20 '21 at 18:32
  • 1
    @Ejrionm You should 1) Generate a training data set. 2) Fit the OLS model. 3) Compute the prediction interval as per the link I've shared in my answer 4) Generate new data from the same data generating process (preferably a large amount, maybe 10000 samples) 5) Determine the proportion of new samples which lay within the prediction interval conditional on the predictors. You can repeat this for various $n$ (training set size) and $p$. – Demetri Pananos Aug 20 '21 at 18:41
  • 1
    @Ejrionm I've added some code to illustrate the process. – Demetri Pananos Aug 20 '21 at 18:57
  • I think your simulation is incorrect. You seem to base your estimate of coverage on a single realisation of the part of the data you use to estimate beta (`y` in your data) but you need to use many realisations. This is why the coverage is significantly different from the nominal level in most (if not all?) cases. – Jarle Tufto Aug 20 '21 at 21:14
  • 1
    @JarleTufto I think its correct. I use a different dataset to construct the model than the dataset I use to compute coverage. – Demetri Pananos Aug 20 '21 at 21:42
  • @DemetriPananos Thank you so much for your time – Ejrionm Aug 20 '21 at 22:33
  • @DemetriPananos No, it can't be. Prediction intervals for the linear model are exact (regardless of sample size) but your estimated coverage is clearly statistically significant from the nominal level (you have 10000 replicates). How do you explain that? – Jarle Tufto Aug 21 '21 at 08:02
  • @JarleTufto Do you mean to say the coverage should be *exactly* 95%? The interval depends on estimates of the residual variance. As the sample size increases, that estimate becomes more precise and the coverage closer to the nominal. You see this happen in my simulation as the sample size increases. As for the rows which larger/smaller coverage, please remember these are from a single model. I imagine you would see the results you expected repeating this simulation many times for the specified $n,p$ as I instruct OP to do. – Demetri Pananos Aug 21 '21 at 12:51
  • @JarleTufto Additionally, having far greater/lesser coverage than the nominal is not something that is completely unexpected and would depend on $n,p$. As with binominal intervals, I would expect that the coverage becomes less/more than the nominal when $n,p$ are so such that the regression is dubious or high variance. That being said you are welcome to submit your own answer to OP's question. If I'm wrong, then I would appreciate a definitive answer with the correct approach. – Demetri Pananos Aug 21 '21 at 13:11
  • @DemetriPananos Fine, I added my own answer. – Jarle Tufto Aug 21 '21 at 14:44
  • @DemetriPananos What you say in your edit is not what I point out. While the predictors are usually regarded as fixed quantities, the coverage would still be equal to the nominal level if the new x-value varied between realisations. – Jarle Tufto Aug 21 '21 at 18:52
4

As requested by Demetri's in the comments to his (incorrect) answer, here is R code correctly estimating the coverage of both prediction and confidence intervals by simulation. For simplicity only a single covariate is considered.

coverage <- function(
  n, # sample size
  x = rnorm(n), # covariate 
  beta = rnorm(2), # true parameter values
  sigma = 1, # true error variance
  xnew = 1, # new x-value for which to predict
  nsim = 1e+4 # number of replicates to simulate
) {
  ci.hits <- 0
  pi.hits <- 0
  for (i in 1:nsim) {
    # simulate the observed data
    y <- beta[1] + beta[2]*x + rnorm(n, sd=sigma)
    # fit the model
    mod <- lm(y ~ x)
    # simulate a new observation to predict
    yhat <- beta[1] + beta[2]*xnew
    ynew <- yhat + rnorm(1, sd=sigma)
    # compute confidence and prediction intervals
    pi <- predict(mod, newdata=data.frame(x=xnew), interval="prediction")
    if (pi[1,"lwr"] < ynew & ynew < pi[1,"upr"])
      pi.hits <- pi.hits + 1
    ci <- predict(mod, newdata=data.frame(x=xnew), interval="confidence")
    if (ci[1,"lwr"] < yhat & yhat < ci[1,"upr"])
      ci.hits <- ci.hits + 1
  }
  list(pi.coverage = pi.hits/nsim, ci.coverage=ci.hits/nsim)
}

Varying for example the sample size $n$ (see simulations below) the coverage never deviate significantly from the nominal level of 0.95. This is of course as expected as it is well known that these intervals are exact (see any mathematical statistics textbook on linear regression).

> set.seed(1)
> coverage(n = 5)
$pi.coverage
[1] 0.952

$ci.coverage
[1] 0.9501

> coverage(n = 10)
$pi.coverage
[1] 0.9478

$ci.coverage
[1] 0.9507

> coverage(n = 20)
$pi.coverage
[1] 0.9532

$ci.coverage
[1] 0.9509
Jarle Tufto
  • 7,989
  • 1
  • 20
  • 36
  • Ok, yes I concede that I am wrong. Here is the sticking point: Where as I used several different values of the predictor you condition on a single value of the predictor (here, called xnew). I had made the mistake in thinking that coverage statements applied across the function, in addition to being pointwise. To make that the case, I am told a Scheffé adjustment might be applicable. I will edit my answer to point to yours instead. – Demetri Pananos Aug 21 '21 at 16:15
  • @DemetriPananos No, the sticking point is still what I pointed out in my first comment to your answer. I guess you agree that in frequentist statistic the coverage of a confidence/prediction interval is the probability that the interval contains the parameter/random variable of interest in the long run. So it follows that you need to simulate many realisations from the model to obtain a valid estimate of the coverage. – Jarle Tufto Aug 21 '21 at 18:49
2

Typically one would be interested in the confidence interval $(\widehat{\boldsymbol{\beta}}^T_n\boldsymbol{x}_i \pm z_{\alpha/2}\hat{\text{se}}_n)$, a set of plausible values for the unknown fixed true ${\boldsymbol{\beta}}^T\boldsymbol{x}_i$, where $\hat{\text{se}}_n$ is the estimated standard error of $\widehat{\boldsymbol{\beta}}^T_n\boldsymbol{x}_i$ based on a sample of size $n$.

If this is not what you are interested in, perhaps you are interested in predicting the estimated mean resulting from a future experiment using your current sample of size $n$. Using your notation $N$ for the future sample size, this prediction interval would take the form

$$(\widehat{\boldsymbol{\beta}}^T_n\boldsymbol{x}_i \pm z_{\alpha/2}\sqrt{\hat{\text{se}}_n^2 + n\cdot\hat{\text{se}}_n^2/N}).$$

In this setting $\widehat{\boldsymbol{\beta}}^T_n\boldsymbol{x}_i$ is your best estimate of the future experimental result $\widehat{\boldsymbol{\beta}}^T_N\boldsymbol{x}_i$, and $n\cdot\hat{\text{se}}_n^2/N$ is your best estimate of its variance. In repeated sampling both $\widehat{\boldsymbol{\beta}}^T_n\boldsymbol{x}_i$ and $\widehat{\boldsymbol{\beta}}^T_N\boldsymbol{x}_i$ will vary, and this is reflected in the term $\sqrt{\hat{\text{se}}_n^2 + n\cdot\hat{\text{se}}_n^2/N}$. The prediction interval above is the inversion of a Wald test of $H_0: \widehat{\boldsymbol{\beta}}^T_N\boldsymbol{x}_i= c$ using the pivotal quantity

$$\frac{\widehat{\boldsymbol{\beta}}^T_n\boldsymbol{x}_i-\widehat{\boldsymbol{\beta}}^T_N\boldsymbol{x}_i}{\sqrt{\hat{\text{se}}_n^2 + n\cdot\hat{\text{se}}_n^2/N}}\sim N(0,1).$$

The resulting p-value is a predictive p-value. It represents the plausibility of the hypothesis given the observed data and is useful for controlling the type I error rate $\alpha$ when making a prediciton. In repeated sampling a $95\%$ prediction interval will cover the future experimental result. Let me know if I have made any mistakes, and if you need further details. Here is a wikipedia entry on prediction intervals, and a paper.

Geoffrey Johnson
  • 2,460
  • 3
  • 12
  • Thank you so much for you answer. My situation is this: we are working in the context where $n$ and $p$ increase (of course, in real life these are fixed but we're interested in the case $n < p$) and we found that for any $\boldsymbol{X}_N \in \mathbb{R}^p$ it holds $V^{-1/2}(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T \boldsymbol{X}_N \leadsto N(0,1)$ (it doesn't matter if this is the OLS estimator or other thing) where $V$ is the asymptotic variance. We worked with $\boldsymbol{\beta}^T\boldsymbol{X}_N$ because this remains univariate when $p \rightarrow \infty$. – Ejrionm Aug 20 '21 at 17:57
  • Of course, from this we can say that an interval for $\boldsymbol{\beta}^T\boldsymbol{X}_N$ is $(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N \pm z_{\alpha/2}\hat{V}^{1/2})$. So now we want to assess this interval by computing the coverage, I simulated a dataset just for the sake of simplicity – Ejrionm Aug 20 '21 at 17:57
  • 2
    It sounds like you are interested in an interval that covers the population mean, not a prediction interval for a future experimental result. The key to assessing coverage is to simulate repeated experiments (say 10,000), each time constructing the confidence interval and tallying (0 or 1) whether the interval indeed covers the truth. Then the average of these tallies is your coverage rate. Ideally a 95% confidence interval will cover the truth in 95% of your simulated experiments. – Geoffrey Johnson Aug 20 '21 at 18:11
  • But what would happen if I have just one dataset (no simulation), and I want to construct the interval? In that case, I just have one estimator for $\beta$ and can construct only one interval (that will depend on $X_N$)?, so coverage is something I can do only with simulations? Thanks for you answers, again – Ejrionm Aug 20 '21 at 18:37
  • 1
    That is right. For any one simulated data set the resulting confidence interval either covers or it does not. To investigate the coverage probability you must look at many repeated experiments and identify the proportion of intervals that cover. – Geoffrey Johnson Aug 20 '21 at 21:33
  • Thank you so much for your time – Ejrionm Aug 20 '21 at 22:34