What if we start with a simpler example? The best place is a binary distribution.
CLT for Binary Variables
Suppose that $X$ takes two values $\{\mu-\sigma,\mu+\sigma)$. Think of this as a coin toss between a high and a low value:
$$ \mathbb{P}(X = \mu - \sigma) = \frac{1}{2} $$
$$ \mathbb{P}(X = \mu + \sigma) = \frac{1}{2} $$
It is straightfoward to check that $X$ has mean and variance $(\mu,\sigma^2)$.
The simplest version of the central limit theorem says that if $\{X_1,\ldots,X_n\}$ are i.i.d. and $\sigma^2 < \infty$, then the sample average $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ satisfies the following property as $n \to \infty$
$$ \frac{\sqrt{N}(\bar{X} - \mu)}{\sigma} \to^d \mathcal{N}(0,1) $$
An intuitive explanation
Why don't we break down the left-hand side in our example?
$$ \frac{\sqrt{N}(\bar{X} - \mu)}{\sigma} = \frac{1}{\sqrt{n}} \sum_{i=1}^n \frac{X_i - \mu}{\sigma} $$
Then the normalized variable $Z_i = \frac{X_i - \mu}{\sigma}$ is actually a binary variable that takes on values $\{-1,1\}$ with probability $\frac{1}{2}$. Then it follows that
$$ \sum_{i=1}^n Z_i\sim 2 * Binomial(n,p=0.5) - n$$
The binomial already has a bell-shaped curve. This comes from the fact that observing a series of extreme events with high draws of $Z_i$ is very unlikely (like in a coin toss). As $n$ gets larger, the sum actually approximates the normal distribution. Dividing by $\sqrt{n}$ is important so that the distribution doesn't diverge to infinity.
Main takeaways for general case
- The central limit theorem is very related to combinations of independent events.
- Is it likely to observe a series of extreme events if observations are i.i.d.? It isn't.
- The general formulation extends this logic.
- Even if $X$ has a continuous distribution, it is still very unlikely to see a sequence of high values.
- Intuitively $\ldots$ counting sequences of events above and below the mean turns out to be more important than the actual values they take. To take into account the dispersion in values you just need to normalize $\bar{X}$ by the variance.
- Hence the most important assumption is that the events are independent (and that $\sigma^2$ is finite).