What is the difference between converges in probability and distribution

Question

What is the difference between the statements:

converges in probability

and

converges in distribution

I am looking for a more intuitive explanation and simple formulas rather than abstract mathematical definitions. (I have no knowledge of measure theory or other abstract mathematical concepts).

I didn't find my answer in this topic: Convergence in probability and distribution

Taylor · Answer 1 · 2018-01-31T02:29:01.587

Intuitively, convergence in probability means the random variables get close to a nonrandom constant, and convergence in distribution means that it gets close to another random variable. Closeness will mean different things in each situation.

Example: Say $X_1, \ldots, X_N$ are independent and identically distributed with mean $\mu$ (we don't have to assume which distribution they come from). By the Weak Law of Large Numbers $$ \bar{X}_n \overset{p}{\to} \mu $$ as $n \to \infty$. Here $\mu$ is a constant, and this result is useful because we know if we have a large dataset, we can effectively "recover" the true and unknown parameter $\mu$.

Also, by the Central Limit Theorem, if in addition to the above assumptions we assume the variance of each $X_i$ is $\sigma^2$, then $$ \sqrt{n}\bar{X}_n \overset{D}{\to} \text{Normal}(\mu, \sigma^2) $$ as $n \to \infty$. Notice that the right hand side is a random variable. This is useful for making confidence intervals, and quantifying the uncertainty about estimates of $\mu$. If we have a large dataset, we can be justified in using those $z_{\alpha/2}$ standard Normal quantiles that are commonly found in CI formulas. Moreover, this works even if we don't make any assumptions about what probability distribution is generating the data.

Note that $\{\sqrt{n}\bar{X}_n\}$ is a different sequence than $\{\bar{X}_n\}$. We need a weaker type of convergence for the former because scaling it by bigger and bigger $\sqrt{n}$ increases the variance by an increasing $n$; however this "balances out" with the shrinking variance of $\bar{X}_n$. The net result is that the sequence of products coalesces around a random variable for which many formulas are known.

Even though convergence in probability for one sequence always implies convergence in distribution for the same sequence (this is mentioned in other answers), in this particular example, we have convergence in distribution for the scaled sequence implying convergence in probability for the sequence of un-scaled sample averages (kind of reversing the order of implication). I mention this because CLTs are often more valuable than results about convergence in probability. This is because if we know how fast we can scale up the sequence of estimators and still have it converge, this tells us something about the convergence "rate." In this particular example, the CLT tells us that $\bar{X}_n$ is "root-n consistent" (not to be confused with regular consistency, which is mentioned in other answers).

I think convergence in probability does not require that the convergence target be constant? — wij, Jan 29 '18 at 10:36
@wij yeah that’s technically correct, but not worth mentioning in my opinion. As I’m sure you’re aware they extend the definition in this case to the difference between the random variables, and the difference converges to a constant. — Taylor, Jan 29 '18 at 14:16
Convergence to a constant in distribution implies convergence in probability. So it's probably not the best example of distinction between the two. — Alex R., Jan 29 '18 at 18:04
@AlexR. Yes that is a true fact, but nowhere in my answer have I discussed convergence in distribution to a constant. OP is asking for intuition, not a list of results. — Taylor, Jan 29 '18 at 18:50

wij · Answer 2 · 2018-01-30T15:20:51.533

Trying my best to use only simple formulas and be intuitive. Consider a sequence of random variable $X_1, X_2, \ldots $, and another random variable $X$.

Convergence in distribution. For each $n$, let us take realizations (generate many samples) from $X_n$, and make a histogram. So we now have a sequence of histograms, one for each $X_n$. If the histograms in the sequence look more and more like the histogram of $X$ as $n$ progresses, we say that $X_n$ converges to $X$ in distribution (denoted by $X_n \stackrel{d}{\rightarrow} X$).

Convergence in probability. To explain convergence in probability, consider the following procedure.

For each $n$,

repeat many times:
- Jointly sample $(x_n, x)$ from $(X_n, X)$.
- Find the absolute difference $d_n = |x_n - x|$.
We have got many values of $d_n$. Make a histogram out of these. Call this histogram $H_n$.

If the histograms $H_1, H_2, \ldots $ become skinnier, and concentrate more and more around 0, as $n$ progresses, we say that $X_n$ converges to $X$ in probability (denoted by $X_n \stackrel{p}{\rightarrow} X$).

Discussion Convergence in distribution only cares about the histogram (or the distribution) of $X_n$ relative to that of $X$. As long as the histograms look more and more like the histogram of $X$, you can claim a convergence in distribution. By contrast, convergence in probability cares also about the realizations of $X_n$ and $X$ (hence the distance $d_n$ in the above procedure to check this).

$(\dagger)$ Convergence in probability requires that the probability that the values drawn from $X_n$ and $X$ match (i.e., low $d_n = |x_n - x|$) gets higher and higher as $n$ progresses.

This is a stronger condition compared to the convergence in distribution. Obviously, if the values drawn match, the histograms also match. This is why convergence in probability implies convergence in distribution. The converse is not necessarily true, as can be seen in Example 1.

Example 1. Let $Z \sim \mathcal{N}(0,1)$ (the standard normal random variable). Define $X_n := Z$ and $X := -Z$. It follows that $X_n \stackrel{d}{\rightarrow} X$ because we get the same histogram (i.e., the bell-shaped histogram of the standard normal) for each $X_n$ and $X$. Now, it turns out that $X_n$ does not converge in probability to $X$. This is because $(\dagger)$ does not hold. The reason is that realizations of $X_n$ and $X$ will never match, no matter how large $n$ is. Yet, they share the same histogram!

Example 2. Let $Y, Z \sim \mathcal{N}(0,1)$. Define $X_n := Z + Y/n$, and $X := Z$. It turns out that $X_n \stackrel{p}{\rightarrow} X$ (and hence $X_n \stackrel{d}{\rightarrow} X$). Read the text at $(\dagger)$ again. In this case, the values drawn from $X_n$ and $X$ are almost the same, with $x_n$ corrupted by the added noise, drawn from $Y/n$. However, since the noise is weaker and weaker as $n$ progresses, eventually $Y/n=0$, and we get the convergence in probability.

Example 3. Let $Y, Z \sim \mathcal{N}(0,1)$. Define $X_n := Z + n Y$, and $X := Z$. We see that $X_n$ does not converge in distribution. Hence, it does not converge in probability.

The description of convergence in probability looks incorrect. If one were to follow your procedure with a sequence of *identical* uniform variables, for instance (where convergence in probability is trivially true), the histograms $H_n$ would all be the same and definitely wouldn't converge to zero. You need to draw pairs from each *joint* distribution of $(X,X_n)$ for this procedure to produce anything meaningful. — whuber, Jan 29 '18 at 21:20
Oops. You are right. I will need to draw $x, x_n$ from $(X, X_n)$ jointly. — wij, Jan 30 '18 at 15:13

score 1 · Answer 3 · answered Jan 29 '18 at 18:01

I'm not going to repeat the mathematical definitions here and here (or in the other answers).

At a high level, I think it's helpful to think in terms of samples. The concepts of stochastic convergence are used to study how samples behave as the sample size increases.

Individually,

Convergence in probability is a weak statement to make. You may have seen this property being invoked when discussing the consistency of an estimator or by the Weak Law of Large Numbers. I think it's better to focus on how you can use this concept rather than the mathematical definition: let's say you're using your samples to estimate some true property in the wider population and you collect more and more samples. If your estimate converges in probability to the true value, then you have a consistent estimator (and that's a good thing!) because your estimate is now probably the true value.
Convergence in distribution is an even weaker statement to make. It basically says that the Cumulative Distribution Functions of random variables converge (i.e. get more and more similar in their shapes). It doesn't discuss actual probabilities.

Relating the two concepts,

Convergence in probability (stronger) implies convergence in distribution (weaker). Under certain conditions, the opposite relationship may hold as well.

You don't actually describe or characterize either kind of convergence! — whuber, Jan 29 '18 at 21:21
The OP asked for an intuitive comparison between the 2 types, not for mathematical definitions and properties. — A. G., Jan 30 '18 at 08:37
I'm not trying to suggest that mathematical definitions are needed, but only trying to point out that your answer lacks the "intuitive explanation" of the two concepts that is being asked for. Here's the test I am applying: after reading the post, anybody who had not studied the mathematical definitions still would know nothing about what the two types of convergence are. — whuber, Jan 30 '18 at 14:39

What is the difference between converges in probability and distribution

3 Answers3