Intuitive explanation of convergence in distribution and convergence in probability

Question

What is the intuitive difference between a random variable converging in probability versus a random variable converging in distribution?

I've read numerous definitions and mathematical equations, but that does not really help. (Please keep in mind, I am undergraduate student studying econometrics.)

How can a random variable converge to a single number, but also converge to a distribution?

"How can *a* random variable converge to *a* single number but *also* converge to *a* distribution?" - I think you'd benefit from clarifying whether your confusion is that RVs in general can converge to either single numbers or to a whole distribution (less of a mystery once you realise that the "single number" is essentially a special type of distribution) or whether your confusion is how a single RV might converge to a constant according to one mode of convergence, but to a distribution according to another mode of convergence? — Silverfish, Jan 23 '15 at 22:11
Like @CloseToC I wonder if you've come across regressions where on the one hand you've been told $\hat \beta$ is "asymptotically normal" but on the other hand you have been told it converges to the true $\beta$. — Silverfish, Jan 23 '15 at 22:13

Silverfish · Answer 1 · 2015-01-26T23:03:22.053

It's not clear how much intuition a reader of this question might have about convergence of anything, let alone of random variables, so I will write as if the answer is "very little". Something that might help: rather than thinking "how can a random variable converge", ask how a sequence of random variables can converge. In other words, it's not just a single variable, but an (infinitely long!) list of variables, and ones later in the list are getting closer and closer to ... something. Perhaps a single number, perhaps an entire distribution. To develop an intuition, we need to work out what "closer and closer" means. The reason there are so many modes of convergence for random variables is that there are several types of "closeness" I might measure.

First let's recap convergence of sequences of real numbers. In $\mathbb{R}$ we can use Euclidean distance $|x-y|$ to measure how close $x$ is to $y$. Consider $x_n = \frac{n+1}{n} = 1 + \frac{1}{n}$. Then the sequence $x_1, \, x_2, \, x_3, \dots$ starts $2, \frac{3}{2}, \frac{4}{3}, \frac{5}{4}, \frac{6}{5}, \dots$ and I claim that $x_n$ converges to $1$. Clearly $x_n$ is getting closer to $1$, but it's also true that $x_n$ is getting closer to $0.9$. For instance, from the third term onwards, the terms in the sequence are a distance of $0.5$ or less from $0.9$. What matters is that they are getting arbitrarily close to $1$, but not to $0.9$. No terms in the sequence ever come within $0.05$ of $0.9$, let alone stay that close for subsequent terms. In contrast $x_{20}=1.05$ so is $0.05$ from $1$, and all subsequent terms are within $0.05$ of $1$, as shown below.

Convergence of (n+1)/n to 1

I could be stricter and demand terms get and stay within $0.001$ of $1$, and in this example I find this is true for the terms $N=1000$ and onwards. Moreover I could choose any fixed threshold of closeness $\epsilon$, no matter how strict (except for $\epsilon = 0$, i.e. the term actually being $1$), and eventually the condition $|x_n - x| \lt \epsilon$ will be satisfied for all terms beyond a certain term (symbolically: for $n \gt N$, where the value of $N$ depends on how strict an $\epsilon$ I chose). For more sophisticated examples, note that I'm not necessarily interested in the first time that the condition is met - the next term might not obey the condition, and that's fine, so long as I can find a term further along the sequence for which the condition is met and stays met for all later terms. I illustrate this for $x_n = 1 + \frac{\sin(n)}{n}$, which also converges to $1$, with $\epsilon=0.05$ shaded again.

Convergence of 1 + sin(n)/n to 1

Now consider $X \sim U(0,1)$ and the sequence of random variables $X_n = \left(1 + \frac{1}{n}\right) X$. This is a sequence of RVs with $X_1 = 2X$, $X_2 = \frac{3}{2} X$, $X_3 = \frac{4}{3} X$ and so on. In what senses can we say this is getting closer to $X$ itself?

Since $X_n$ and $X$ are distributions, not just single numbers, the condition $|X_n - X| \lt \epsilon$ is now an event: even for a fixed $n$ and $\epsilon$ this might or might not occur. Considering the probability of it being met gives rise to convergence in probability. For $X_n \overset{p}{\to} X$ we want the complementary probability $P(|X_n - X| \ge \epsilon)$ - intuitively, the probability that $X_n$ is somewhat different (by at least $\epsilon$) to $X$ - to become arbitrarily small, for sufficiently large $n$. For a fixed $\epsilon$ this gives rise to a whole sequence of probabilities, $P(|X_1 - X| \ge \epsilon)$, $P(|X_2 - X| \ge \epsilon)$, $P(|X_3 - X| \ge \epsilon)$, $\dots$ and if this sequence of probabilities converges to zero (as happens in our example) then we say $X_n$ converges in probability to $X$. Note that probability limits are often constants: for instance in regressions in econometrics, we see $\text{plim}(\hat \beta) = \beta$ as we increase the sample size $n$. But here $\text{plim}(X_n) = X \sim U(0,1)$. Effectively, convergence in probability means that it's unlikely that $X_n$ and $X$ will differ by much on a particular realisation - and I can make the probability of $X_n$ and $X$ being further than $\epsilon$ apart as small as I like, so long as I pick a sufficiently large $n$.

A different sense in which $X_n$ becomes closer to $X$ is that their distributions look more and more alike. I can measure this by comparing their CDFs. In particular, pick some $x$ at which $F_X(x) = P(X \leq x)$ is continuous (in our example $X \sim U(0,1)$ so its CDF is continuous everywhere and any $x$ will do) and evaluate the CDFs of the sequence of $X_n$s there. This produces another sequence of probabilities, $P(X_1 \leq x)$, $P(X_2 \leq x)$, $P(X_3 \leq x)$, $\dots$ and this sequence converges to $P(X \leq x)$. The CDFs evaluated at $x$ for each of the $X_n$ become arbitrarily close to the CDF of $X$ evaluated at $x$. If this result holds true regardless of which $x$ we picked, then $X_n$ converges to $X$ in distribution. It turns out this happens here, and we should not be surprised since convergence in probability to $X$ implies convergence in distribution to $X$. Note that it can't be the case that $X_n$ converges in probability to a particular non-degenerate distribution, but converges in distribution to a constant. (Which was possibly the point of confusion in the original question? But note a clarification later.)

For a different example, let $Y_n \sim U(1, \frac{n+1}{n})$. We now have a sequence of RVs, $Y_1 \sim U(1,2)$, $Y_2 \sim U(1,\frac{3}{2})$, $Y_3 \sim U(1,\frac{4}{3})$, $\dots$ and it is clear that the probability distribution is degenerating to a spike at $y=1$. Now consider the degenerate distribution $Y=1$, by which I mean $P(Y=1)=1$. It is easy to see that for any $\epsilon \gt 0$, the sequence $P(|Y_n - Y| \ge \epsilon)$ converges to zero so that $Y_n$ converges to $Y$ in probability. As a consequence, $Y_n$ must also converge to $Y$ in distribution, which we can confirm by considering the CDFs. Since the CDF $F_Y(y)$ of $Y$ is discontinuous at $y=1$ we need not consider the CDFs evaluated at that value, but for the CDFs evaluated at any other $y$ we can see that the sequence $P(Y_1 \leq y)$, $P(Y_2 \leq y)$, $P(Y_3 \leq y)$, $\dots$ converges to $P(Y \leq y)$ which is zero for $y \lt 1$ and one for $y \gt 1$. This time, because the sequence of RVs converged in probability to a constant, it converged in distribution to a constant also.

Some final clarifications:

Although convergence in probability implies convergence in distribution, the converse is false in general. Just because two variables have the same distribution, doesn't mean they have to be likely to be to close to each other. For a trivial example, take $X\sim\text{Bernouilli}(0.5)$ and $Y=1-X$. Then $X$ and $Y$ both have exactly the same distribution (a 50% chance each of being zero or one) and the sequence $X_n=X$ i.e. the sequence going $X,X,X,X,\dots$ trivially converges in distribution to $Y$ (the CDF at any position in the sequence is the same as the CDF of $Y$). But $Y$ and $X$ are always one apart, so $P(|X_n - Y| \ge 0.5)=1$ so does not tend to zero, so $X_n$ does not converge to $Y$ in probability. However, if there is convergence in distribution to a constant, then that implies convergence in probability to that constant (intuitively, further in the sequence it will become unlikely to be far from that constant).
As my examples make clear, convergence in probability can be to a constant but doesn't have to be; convergence in distribution might also be to a constant. It isn't possible to converge in probability to a constant but converge in distribution to a particular non-degenerate distribution, or vice versa.
Is it possible you've seen an example where, for instance, you were told a sequence $X_n$ converged another sequence $Y_n$? You may not have realised it was a sequence, but the give-away would be if it was a distribution that also depended on $n$. It might be that both sequences converge to a constant (i.e. degenerate distribution). Your question suggests you're wondering how a particular sequence of RVs could converge both to a constant and to a distribution; I wonder if this is the scenario you're describing.
My current explanation is not very "intuitive" - I was intending to make the intuition graphical, but haven't had time to add the graphs for the RVs yet.

An incredibly comprehensive and lucid explanation. Thank you. — ColorStatistics, Apr 03 '20 at 16:17
After reading this topic from dozens of sources, I finally understood it. Thank you very much, sir. — FCardelle, Aug 16 '21 at 18:16

score 34 · Accepted Answer · edited Mar 15 '18 at 02:25

How can a random number converge to a constant?

Let's say you have $N$ balls in the box. You can pick them one by one. After you picked $k$ balls, I ask you: what's the mean weight of the balls in the box? Your best answer would be $\bar x_k=\frac{1}{k}\sum_{i=1}^kx_i$. You realize that $\bar x_k$ itself is the random value? It depends on which $k$ balls you picked first.

Now, if you keep pulling the balls, at some point there'll be no balls left in the box, and you'll get $\bar x_N\equiv\mu$.

So, what we've got is the random sequence $$\bar x_1,\dots,\bar x_k, \dots, \bar x_N ,\bar x_N, \bar x_N, \dots $$ which converges to the constant $\bar x_N = \mu$. So, the key to understanding your issue with convergence in probability is realizing that we're talking about a sequence of random variables, constructed in a certain way.

Next, let's get uniform random numbers $e_1,e_2,\dots$, where $e_i\in [0,1]$. Let's look at the random sequence $\xi_1,\xi_2,\dots$, where $\xi_k=\frac{1}{\sqrt{\frac{k}{12}}}\sum_{i=1}^k \left(e_i- \frac{1}{2} \right)$. The $\xi_k$ is a random value, because all its terms are random values. We can't predict what is $\xi_k$ going to be. However, it turns out that we can claim that the probability distributions of $\xi_k$ will look more and more like the standard normal $\mathcal{N}(0,1)$. That's how the distributions converge.

What's the sequence of random variables in your first example after you reach N? How is the limit evaluated? — ekvall, Jan 24 '15 at 03:05
It's just an intuition. Imagine the infinite box, so, your estimator $\bar x_\infty$ converges to the population mean $\mu$. — Aksakal, Jan 24 '15 at 03:11

ekvall · Answer 3 · 2018-03-15T14:40:44.327

In my mind, the existing answers all convey useful points, but they do not make an important distinction clear between the two modes of convergence.

Let $X_n$, $n=1,2,\dots$, and $Y$ be random variables. For intuition, imagine $X_n$ are assigned their values by some random experiment that changes a little bit for each $n$, giving an infinite sequence of random variables, and suppose $Y$ gets its value assigned by some other random experiment.

If $X_n\overset{p}{\to}Y$, we have, by definition, that the probability of $Y$ and $X_n$ differing from each other by some arbitrarily small amount approaches zero as $n\to\infty$, for as small amount as you like. Loosely speaking, far out in the sequence of $X_n$, we are confident $X_n$ and $Y$ will take values very close to each other.

On the other hand, if we only have convergence in distribution and not convergence in probability, then we know that for large $n$, $P(X_n\leq x)$ is almost the same as $P(Y\leq x)$, for almost any $x$. Note that this does not say anything about how close the values of $X_n$ and $Y$ are to each other. For example, if $Y\sim N(0, 10^{10})$, and thus $X_n$ is also distributed pretty much like this for large $n$, then it seems intuitively likely that the values of $X_n$ and $Y$ will differ by quite a lot in any given observation. After all, if there is no restriction on them other than convergence in distribution, they may very well for all practical reasons be independent $N(0,10^{10})$ variables.

(In some cases it may not even make sense to compare $X_n$ and $Y$, maybe they're not even defined on the same probability space. This is a more technical note, though.)

(+1) You don't even need the $X_n$ to vary - I was going to add some detail on this to my answer but decided against it on length grounds. But I think it is a point worth making. — Silverfish, Jan 25 '15 at 16:40

score 13 · Answer 4 · answered Jan 23 '15 at 20:45

What I don't understand is how can a random variable converge to a single number but also converge to a distribution?

If you're learning econometrics, you're probably wondering about this in the context of a regression model. It converges to a degenerate distribution, to a constant. But something else does have a non-degenerate limiting distribution.

$\hat{\beta}_n$ converges in probability to $\beta$ if the necessary assumptions are met. This means that by choosing a large enough sample size $N$, the estimator will be as close as we want to the true parameter, with the probability of it being farther away as small as we want. If you think of plotting the histogram of $\hat{\beta}_n$ for various $n$, it will eventually be just a spike centered on $\beta$.

In what sense does $\hat{\beta}_n$ converge in distribution? It also converges to a constant. Not to a normally distributed random variable. If you compute the variance of $\hat{\beta}_n$ you see that it shrinks with $n$. So eventually it will go to zero in large enough $n$, which is why the estimator goes to a constant. What does converge to a normally distributed random variable is

$\sqrt{n}(\hat{\beta}_n - \beta)$. If you take the variance of that you'll see that it does not shrink (nor grow) with $n$. In very large samples, this will be approximately $N(0, \sigma^2)$ under standard assumptions. We can then use this approximation to approximate the distribution of $\hat{\beta}_n$ in that large sample.

But you are right that the limiting distribution of $\hat{\beta}_n$ is also a constant.

Look upon this as "looking at $\hat{\beta_n}$ with a magnifying glass", with magnification increasing with $n$ at the rate $\sqrt{n}$. — kjetil b halvorsen, May 18 '15 at 13:12

score 8 · Answer 5 · edited Mar 15 '18 at 02:23

8

Let me try to give a very short answer, using some very simple examples.

Convergence in distribution

Let $X_n \sim N\left(\frac{1}{n}, 1 \right)$, for all n, then $X_n$ converges to $X \sim N(0, 1)$ in distribution. However, the randomness in the realization of $X_n$ does not change over time. If we have to predict the value of $X_n$, the expectation of our error does not change over time.

Convergence in probability

Now, consider the random variable $Y_n$ that takes value $0$ with probability $1-\frac{1}{n}$ and $1$ otherwise. As $n$ goes to infinity, we are more and more sure that $Y_n$ will equal $0$. Hence, we say $Y_n$ converges in probability to $0$. Note that this also implies $Y_n$ converges in distribution to $0$.

edited Mar 15 '18 at 02:23

answered Jan 24 '15 at 03:02

Sven

1,021
8
10

When you say "If we have to predict the value of Xn, the expectation of our error does not change over time.", I sense you want to imply P(|X_n -X| > eps) does not converge to 0 as n goes to infinity. Is it possible to prove it does not? To test experimentally, for a large n (say 10^10), we conduct the experiment of sampling from X_n and X for a considerably large number of times and count the instances (X_n-X) > eps. The naive probability is calculated as number of times the event happens and we see it is non zero. Does that make sense ? – whisperer Feb 16 '20 at 21:50
It seems that the definition of convergence in probability is so strict, that it can be achieved by sequences of Random variables towards a constant only. The X to which X1, X2, ... Xn may converge in probability can never be a non-constant distribution. Could there be a any exceptions? – whisperer Feb 16 '20 at 21:54
Regarding your first comment: Yes P(|X_n - X| > eps) does not converge to 0, which isn't too difficult to prove (some algebra while conditioning on the value of X), though my remark is a bit stronger; no estimator independent of X_n converges to 0 error. Regarding your second comment: no that's not necessary at all. Suppose instead of taking on the (fixed) value 0 with probability 1-1/n, Y_n takes on the value X ~ N(0, 1) with probability 1-1/n. Then Y_n converges in probability to X, which is a random variable itself, so Y_n does not converge to a constant. – Sven Feb 17 '20 at 22:49

Antonello · Answer 6 · 2021-05-19T12:26:54.583

Convergence in probability to a constant: a larger and larger share of the PDF get restrained to a band around a fixed value as the sequence progress (note that nothing is said on the values whose probabilities are not yet in the band as the sequence progress)

Convergence in distribution: consider the PDF as a function whose outputs (the values of the PDF) change with the progression of the sequence, so that the function itself, let's say after 10 iteration of the sequence is quite a different beast than the function after 9 iterations, but that eventually stabilise around a given function as the sequence progress. Note that the output is still a function (PDF), not necessarily a single value. The convergence is about the function itself, not about one particular value.

If a sequence of random variable converges in probabilities to a constant, then it also converges in probabilities to a "degenerated" PDF.

Intuitive explanation of convergence in distribution and convergence in probability

6 Answers6

Convergence in distribution

Convergence in probability

Linked

Related