1

If I have 20 data points and a linear model ${\displaystyle y=\beta _{0}+\beta _{1}x\ +\varepsilon}$ where $\epsilon$ is normally distrubuted with Expected value=0, variance $\sigma^2.$

If I want to make the confidence intervals for the parameters as small as possible, at which x should I collect data?

gauss123
  • 9
  • 1
  • 4
    https://stats.stackexchange.com/questions/564256 gives the answer for $\hat\beta_1.$ The answers for the other parameters might be different. In particular, intuitively (a) there's not much you can do to minimize the variance of $\hat\sigma^2$ and (b) to minimize the variance of $\hat\beta_0,$ set every $x_i=0,$ which might not be practical or attractive in your application. It would help, then, to be more specific about which parameter you want to focus on. Note that, at a minimum, you must stipulate upper and lower bounds for the values of $x.$ – whuber Feb 28 '22 at 17:36
  • @whuber I'm not sure the design of the $X$ affects precision of the intercept. You just need the design of $X$ to be centered which can be a post-hoc correction in some cases. I agree with your upshot: the "best" design is to space the $X$ as (arbitrarily) widely as possible, in other words to maximize $X^TX$. – AdamO Feb 28 '22 at 18:59
  • 2
    @AdamO I didn't claim that was the unique solution for the intercept ;-). It's the *best* one, though, because it doesn't depend on the assumption that the response is a linear function of the explanatory variables (yet it's the *worst* one if you're also interested in estimating $\beta_1$ well!). You are getting to the main point, though, which is there may be trade-offs between the optimal designs depending on what they are intended to optimize. – whuber Feb 28 '22 at 19:11

1 Answers1

-1

First we have to have made a good assumption. That is, if we are assuming linearity, that actually has to be correct for the rest of the answer to be correct as well. Assuming it is correct then for a given homoscedastic noise variance, the range of $x_i$ should be chosen to be as large or broad as possible. That means that if the range of x-values is only twice, we are more likely to get insignificant slopes and intercepts than if we have a range of 10 times from minimum to maximum x-values, and 100 times would be even better, and that goes for both the slope and intercept confidence intervals.

Below is an example in two cases. In the first case, we have $n=19$ x-values from $[8.2,11.8]$

enter image description here

And, in the second case $n=19$ x-values from $[1,19]$ enter image description here

The random probabilities are identical in each case, and the number of points is identical. The random probabilities were used to generate $\mathcal{N}(0,1)$ which identical actual residuals were added to x-values to make y=x+error.

Notice that in the first case, R$^2$ is only 0.80, the 95% confidence intervals are from -5.45 to 0.770 for intercept, and 0.910 to 1.528 for slope. In the second case, R$^2$ is 0.99, the 95% confidence intervals are from -1.29 to 0.117 for intercept, and 0.982 to 1.106 for slope. That is, the 95% confidence interval for a 5 fold increase in x-value range has decreased by exactly 5 times for slope, and by 4.411 times for intercept.

Here are the residuals for the first case: enter image description here And now for the second case: enter image description here Note that the only thing that has changed is the scale of the x-axis.

Now, if the model is wrong, there is no such guarantee, for example if we are using a linear to approximate a quadratic distributed random variate, a larger range of x-values could produce less certainty in the linear slope and intercept. Also, if the noise is not homoscedastic, then transformation of variables should be undertaken first, and then an appropriate model for the transformed variables should be selected. Often, when the transformation is more homoscedastic, the model is better behaved in terms of goodness of fit, as the data may then be better linearized, or not, depending on the context.

Here is an example of the effect of reducing heteroscedasticity from the literature: An improved method for determining renal sufficiency using volume of distribution and weight from bolus 99mTc-DTPA, two blood sample, paediatric data enter image description here

Carl
  • 11,532
  • 7
  • 45
  • 102
  • 1
    This seems to miss the thrust of the question, which asks for "the best possible" experimental design. See https://stats.stackexchange.com/questions/564256 for a solution for the slope estimate. – whuber Mar 01 '22 at 22:10
  • @whuber I answered the question "at which x should I collect data?" where I took x to mean the range of x-values, and for which there is an answer. You answered a different question in so far as I can tell. It is unclear to me what the OP actually wants, but it is perfectly clear that not significant results form truncated range independent variables is one of the single most important error opportunities in statistics. – Carl Mar 01 '22 at 23:19
  • 1
    Once more: when you're not sure what the question is, *please ask the OP for clarification* rather than venturing to post something that might or might not be relevant. – whuber Mar 02 '22 at 14:58
  • @whuber In response to your question, the OP had answered that *prior* to my post as follows "I agree with your upshot: the "best" design is to space the as (arbitrarily) widely as possible." All I did was document that with an example. The answer you provide [elsewhere](https://stats.stackexchange.com/questions/564256) I also looked at. I admire your prowess with equations, but sometimes examples are better, because an example helps visualization, they are more concrete and aid understanding and retention. – Carl Mar 02 '22 at 20:15
  • I don't get it, because I don't see *any* example of that sort in your answer. The optimal design to estimate the slope, for instance, places about half the explanatory values as far to the left as feasible and the remainder as far to the right as feasible. Your graphics appear show the values uniformly spaced. That does not achieve the smallest estimation variance of the slope. – whuber Mar 02 '22 at 20:19
  • @whuber I spaced the values at equal distances on the x-axis and randomly only in the y-direction to insure that the values obtained for slope were unbiased. Do otherwise, and one can introduce slope bias from omitted variable bias. There are perhaps counter-examples for which a regression in y alone are unbiased, or perhaps one can do a bivariate x and y Deming regression but that would have wider confidence intervals as the cost of obtaining unbiased results. The situation in which data values are clumped at the ends might occur for certain beta distributed x-values with the parameters <1, – Carl Mar 03 '22 at 03:50
  • con't an unusual circumstance, and one hard to differentiate from spurious values from outliers. Granted I considered a more narrow case than you did, but my aim was to use the algorithms exactly. – Carl Mar 03 '22 at 03:53
  • con't I considered the case in which one has a choice of range for collecting data, a very common one in medicine. For example, if one does regression only for adults weighing 70 kg, the results do not apply to neonates weighing 3 kg. To get significant results one needs a wide range of patient sizes and a formula derived from only neonates and obese patients weighing > 150 kg would not allow one to make statements concerning 60 kg teenagers, because linearity or otherwise curve shape would be unexplored. So I a not saying that your answer is not useful sometimes, but mine is often useful. – Carl Mar 03 '22 at 04:33
  • @whuber Well, between us, we have covered both interpretations of the question. I think it is irrelevant which one the OP was asking in the sense that both answers have value, and at this point it is clear as mud which one was intended. – Carl Mar 03 '22 at 04:49
  • 1
    I stopped reading halfway through the first of this series of comments because it's simply incorrect and trying to fix all the misconceptions would require far more time and effort than available. – whuber Mar 03 '22 at 13:38
  • @whuber Be nice, please. – Carl Mar 04 '22 at 09:38