3

From course notes, I see that when working with a quantitative variable, we can standardize the sample mean to have a normal distribution (as per the central limit theorem) as long as the sample size is "large". As a result, the distribution of the sample means is normally distribution (whether we are working with $\sigma$ or s) as long as the sample size is "large":

$$\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1),$$

$$\frac{\overline{X} - \mu}{\frac{s}{\sqrt{n}}} \sim N(0,1),$$

If the sample size is "small", then we will have a t-distribution:

$$\frac{\overline{X} - \mu}{\frac{s}{\sqrt{n}}} \sim t_{n-1}.$$

However, we recently started looking at inference for linear regression, and I see the following two equations:

$$\frac{\hat{\mu}_{y|x} - {\mu}_{y|x}} {\sigma{\sqrt{\frac{1}{n}+\frac{(x-\overline{x})^2}{\sum_i(x_i-\overline{x})^2 }}}} \sim N(0,1),$$

$$\frac{\hat{\mu}_{y|x} - {\mu}_{y|x}} {s{\sqrt{\frac{1}{n}+\frac{(x-\overline{x})^2}{\sum_i(x_i-\overline{x})^2 }}}} \sim t_{n-2}.$$

I am wondering if the second equation can be normally distributed if its sample size is "large". In other words, if we have a large sample size, then can we still use the central limit theorem and show that:

$$\frac{\hat{\mu}_{y|x} - {\mu}_{y|x}} {s{\sqrt{\frac{1}{n}+\frac{(x-\overline{x})^2}{\sum_i(x_i-\overline{x})^2 }}}} \sim N(0,1).$$

The course notes make it seem as though when working with $\hat{\mu}_{y|x}$ (a sample mean) we cannot use the central limit theorem like we can for $\overline{x}$ (a sample mean). In the case of linear regression, it seems that only $\sigma$ and s determine whether we have a normal distribution or t-distribution respectively.

Is this correct, and if so, why can't we apply the central limit theorem in the linear regression case?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user84756
  • 447
  • 1
  • 4
  • 9
  • Note that there is a nuance in your first two formula. In the case of large sample size you have **equality** $$\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}} \sim N(0,1)$$ and **approximation** $$\frac{\bar{X}-\mu}{\frac{s}{\sqrt{n}}} \sim t_{n-2} \underset{{n \to \infty}}{\sim} N(0,1)$$ You can do the same for the linear regression. – Sextus Empiricus Oct 04 '17 at 11:21

1 Answers1

2

The difference is 'type of test' and not 'sample size'

The difference between the two formula, $\sigma$ vs $s$, is not in the difference of sample size.

The difference is whether $\sigma$ is known or estimated. The first formula uses a normalization with a "known" standard deviation, and the second formula uses a normalization with the sample estimate of the standard deviation. The first, $\sigma$, is a constant, the second, $s$, is a random variable (with chi-square distribution).

So the difference is:

  • you use the t-distribution $\mathcal{N}(0,1)/\sqrt{\chi_{n-1}/(n-1)}$ to describe the distribution of the difference between a 'sample mean' and 'the population mean', if this difference is normalized based on the sample estimate of the standard deviation
  • and you use the normal distribution $\mathcal{N}(0,1)$ to describe the distribution of the difference between a 'sample mean' and 'the population mean', if this difference is normalized based on the standard deviation of the population.

Note:

for large sample sizes you do get that the distribution of this chi-square denominator becomes closer to a peak around 1 $$\lim_{n \to \infty} \sigma_{\left(\frac{\chi_{n-1}}{n-1}\right)} = \sqrt{\frac{2}{n-1}}= 0 \qquad \mathrm{and} \qquad \mu_{\left(\frac{\chi_{n-1}}{n-1}\right)} = 1 $$ or in other words the sample estimate of the standard deviation is less variable $$ \lim_{n \to \infty} s = \sigma$$ and the t-distribution becomes approximately a normal distribution $$ \lim_{n \to \infty} t_n = \mathcal{N}(0,1)$$

So you could say that: for large sample sizes the formula that are used with sample estimated standard deviation approximate the formula that are used with known standard deviation.

This is a different thing than the central limit theorem in which a mean of sample of variables from a non-normal distribution becomes a normal distribution for large $n$.


Note:

The standard deviation is often not 'really' known. But it can be 'hypothetically' known. For instance in testing a hypothesis or in Bayesian inference you 'assume' a certain deviation. (in the same way as $\mu$ is not known but you can still use it in the formula and use it hypothetically, for instance in determining confidence intervals)

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161