2

I'm currently learning statistics. I fear confusing myself and doing some abusive shortcuts.

In my mind, they are few beliefs :

  1. As soon as a serial of 30 values or more appears, the Central Limit Theorem can be used.

  2. The CLT being used, the distribution is given as Gaussian.

But really each time ?

  1. Has the fact that two variables are each a Gaussian distribution something to do with your hability to find a linear regression line, if you do a bivariate analysis with these two variables ?

2 Answers2

4

As soon as a serial of 30 values or more appears, the Central Limit Theorem can be used.

This is flat out untrue. You may have read something vaguely like this in a book but it's (demonstrably) not the case.

Here's an example where we look at the distribution of sample means, where the observations are drawn from a distribution to which the central limit theorem applies:

Histogram and jittered stripchart of 10000 sample means, each sample of size n=100

Here the sample size is 100. Pick any sample size you like, it's easy to find cases where the distribution of sample means looks even worse.

If you start with a very skewed distribution, sample means will also be somewhat skewed, and it may take extremely large samples to make that skewness small enough to not matter quite a lot.

The CLT being used, the distribution is given as Gaussian.

The distribution of the variable you have values on does not become Gaussian if you get larger samples from it. It's whatever it was when you started.

The central limit theorem relates to the distribution of standardized sample means (or sums), in the limit as $n$ goes to infinity, as long as certain conditions hold.

Even when those conditions do hold, there's no finite sample size at which you can say that the distribution of standardized sample means will be Gaussian (though it may well be approximately Gaussian at large sample sizes).

You can read statements of several of the central limit theorems on Wikipedia.

Has the fact that two variables are each a Gaussian distribution something to do with your hability to find a linear regression line, if you do a bivariate analysis with these two variables?

I don't follow the question, sorry.

However, merely having two Gaussian variables do not imply a linear regression relationship exists between them. A regression relationship would exist (specifically, that the conditional mean of either one of the variables has a linear relationship with the other variable -- i.e. $E(Y|X=x) = \alpha+\beta x$) if they were jointly Gaussian -- though possibly with slope $0$ -- not simply if they were individually Gaussian. (Again, that it doesn't hold if you don't have joint normality can readily be demonstrated - there are a number of examples already on site.)

Glen_b
  • 257,508
  • 32
  • 553
  • 939
2

Typically a sum of many random variables tends to a normal/Gaussian distribution (more precisely a standardized sum, subject to certain mathematical restrictions). In many statistical situations 30 can be considered as a big number, justifying the use of CTL.

As for linear regression: normal distribution is implicit in OLS (optimal least squares) approach that you seem to be referring to. There are however other ways to do regression, which explicitly avoid relying on the normal/Gaussian assumption.

Roger Vadim
  • 1,481
  • 6
  • 17
  • Given there are so many misunderstandings surrounding the CLT, it pays to be careful. In particular, the sum of many random variables does not tend to any distribution at all (and *never* does when those random variables are iid and nonconstant); the *standardized* sum will tend to Normality only under strong restrictions (such as iid with finite variance); and OLS uses Normality assumptions only for testing hypotheses--but not for estimation. – whuber Nov 22 '19 at 14:49
  • Normality assumption is implicit in using square deviations. That is one either assumes that the data are normally distributed or neglects deviations from normality. – Roger Vadim Nov 22 '19 at 15:17
  • That's true only when one is making probabilistic inferences. The squared differences also have an interpretation in terms of a loss function--indeed, that was the original use of least-squares methods historically--and that requires no assumptions of normality. – whuber Nov 22 '19 at 16:10
  • Again, in any practical situation, what justifies the use of a quadratic loss function? One either assumes normality or it is an expansion around an extremum. – Roger Vadim Nov 22 '19 at 18:56
  • Please see any textbook on statistics based on decision theory, such as Lehman or Kiefer. Many loss functions can be reduced to a quadratic loss, which is why quadratic loss is so commonly encountered and is so eminently practical. – whuber Nov 22 '19 at 20:29