1

The question is a bit similar to question 147242 . I'm dealing with a multiple linear regression model, say: $$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 $$ and I'm looking for an algebraic equation to calculate (numerically) the prediction interval (PI) for a new prediction $y_0$.

In contrast to the previously discussed examples, each of the model coefficients ($\beta_0, \beta_1$ and $\beta_2$) in this case have an error-bar (extracted via bootstrapping from a distribution, with the distributions being numerical in nature not analytic, and the distributions are specific for each of the three coefficients).

So far most examples deal with either a trivial single linear regression or a multiple linear regression, but in each case only the impact of $x_1$...$x_n$ terms is considered (e.g., question 147242). This in itself is already quite useful, however, in my specific case the $\beta_0$,$\beta_1$ and $\beta_2$ are the mean values of distributions.[*cf PPS] Is there a way to incorporate the uncertainty of the $\beta_i$'s (c.q. the "error-bars") in the calculation of the prediction interval (and Confidence Interval).

To put it very simple, how can the equation $$ \hat{V}_f=s^2\cdot\mathbf{x_0}\cdot\mathbf{(X^TX)^{-1}}\cdot\mathbf{x_0^T} + s^2 $$ be modified to also incorporate the fact that the coefficients themselves are a mean of a distribution.

(PS: One could create an ensemble of various model instances with the $\beta_i$ drawn from their respective distributions, and based on the distribution of obtained $y_0$ calculate the CI of the $y_0$, but this is not really computationally efficient and brings a lot of other issues which I would like to avoid.)

(PPS: The regression model presented is not the result of a direct regression toward a single data set, instead it is constructed as follows:

  1. Create an ensemble of N data sets.
  2. On each data set a regression gives rise to a linear model as indicated in the post above. This gives rise to N values for each of the coefficients $\beta$.
  3. The mean of each of the three sets is calculated.
  4. These three mean coefficients are the coefficients of the model presented above.
  5. The goal here: find the prediction interval for the averaged model above taking into account the fact that the coefficients $\beta$ are calculated from numerical distributions.)
  • Standard formulas for prediction limits in regression already account for uncertainty in the estimates of the coefficients. Could you therefore clarify what you mean by "the possible variation of the $\beta_i$"? – whuber Apr 01 '20 at 13:00
  • (1) How is that the case if the $\beta_i$ are nowhere present in the equation (just asking, since I am not a statistician) It would be nice if it were the case that the results are totally independent of the distribution of $\beta_i$ for an ensemble of fitted model instances, but I really really doubt that to be the case)? (2) I my case, the $\beta_i$ are the mean values of a set of $\beta_i$ (as is stated in the question) values. Their distribution is something undefined and ugly (pure numerical result which is far from gaussian).Hence possible variation would mean variation around this mean. – DannyVanpoucke Apr 01 '20 at 13:10
  • I am unable to follow your comment: the $\beta_i,$ by definition, are the coefficients of the model. In what sense would they be "nowhere present" in any relevant equation?? If you're referring to the formula for the prediction interval, then review its derivation: it starts with the *estimates* of the $\beta_i$ and so is directly related to them. – whuber Apr 01 '20 at 13:47
  • I 'm referring to the prediction interval indeed. The relevant equation is the Vf of the post (yes of course there is the estimate of PI(y)=y +- Vf, so $\beta_i$ are present in y, but where are the error bars of the $\beta_i$ present?) As I read it now, the error bars are purely function of the descriptors $x_i$ . Regarding the derivation, there is $\beta_i$ but that are numbers obtained from a fit/regression not distributions. (cf. Statistical Analysis and Data Display of Heiberger and Holland). – DannyVanpoucke Apr 01 '20 at 14:17
  • The error bars depend strongly on an estimate of the error variance $s^2,$ which comes from the response variable. Thus they are not "purely [a] function of" the explanatory variable. – whuber Apr 01 '20 at 14:36
  • So would it be possible to either open this question again or to answer it.(Note that the latter is not the case as you do not address the issue I state) This seems to be a problem which is not the default situation in statistics (which you are trying to pull it toward), hence the reason I posted the question here. For simplicity again: I have error bars on the $\beta_i$\ a distribution of $\beta_i$\ a CI on $\beta_i$ (where the i does not refer to the data-point index used for the regression) and I want to know how to incorporate this in the calculation of a PI. – DannyVanpoucke Apr 01 '20 at 14:52
  • It does look like a unique question, but could you elaborate on what this "ensemble of models" is, how they might be related, what procedure(s) you use to estimate their parameters, and the specific intended meaning of this prediction interval? – whuber Apr 01 '20 at 16:04
  • No, I can not do this here. But I'll gladly put a reference to the paper here, once it is finished. None of these questions, though in their own right very interesting, are relevant to the problem as the solution I am looking for should be independent of this (to the same level the answer in 147242 is independent of how the data-points for the regression are generated). – DannyVanpoucke Apr 01 '20 at 16:24
  • I think that in the absence of such information the question is too vague to be answerable. If others believe otherwise they will vote to reopen it. – whuber Apr 01 '20 at 17:06
  • 1
    Then please clearly state which information you want (not the vague questions as now) and explicitly state why and how this information is relevant to the be able to answerable. Just stating you don't understand it does not help. – DannyVanpoucke Apr 01 '20 at 17:17
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/106202/discussion-between-dannyvanpoucke-and-whuber). – DannyVanpoucke Apr 01 '20 at 17:55
  • If instead of your stepp 3 (compute means) you did a meta--analysis you would have the standard errors you need I think but I am not sure whether that completely answers your question. – mdewey Apr 03 '20 at 14:53
  • Thanks for the idea. I'm not so sure as well. My goal is to get error's on $y$, not on $\beta$ (those I have, in one form or another). In addition, it is not my intent to personally look at the data, but to automate the process and build an efficient method for the prediction (HPC background, sorry). I could do an ensemble of predictions, and then from the set of $y$'s get the statistics...but why run 1000 or 10000 predictions, if you can do 1, and construct the error based on the ensemble of models. – DannyVanpoucke Apr 03 '20 at 15:47

0 Answers0