Trying my best to use only simple formulas and be intuitive. Consider a sequence of random variable $X_1, X_2, \ldots $, and another random variable $X$.
Convergence in distribution. For each $n$, let us take realizations (generate many samples) from $X_n$, and make a histogram. So we now have a sequence of histograms, one for each $X_n$. If the histograms in the sequence look more and more like the histogram of $X$ as $n$ progresses, we say that $X_n$ converges to $X$ in distribution (denoted by $X_n \stackrel{d}{\rightarrow} X$).
Convergence in probability. To explain convergence in probability, consider the following procedure.
For each $n$,
- repeat many times:
- Jointly sample $(x_n, x)$ from $(X_n, X)$.
- Find the absolute difference $d_n = |x_n - x|$.
- We have got many values of $d_n$. Make a histogram out of these. Call this histogram $H_n$.
If the histograms $H_1, H_2, \ldots $ become skinnier, and concentrate more and more around 0, as $n$ progresses, we say that $X_n$ converges to $X$ in probability (denoted by $X_n \stackrel{p}{\rightarrow} X$).
Discussion Convergence in distribution only cares about the histogram (or the distribution) of $X_n$ relative to that of $X$. As long as the histograms look more and more like the histogram of $X$, you can claim a convergence in distribution. By contrast, convergence in probability cares also about the realizations of $X_n$ and $X$ (hence the distance $d_n$ in the above procedure to check this).
$(\dagger)$ Convergence in probability requires that the probability that the values drawn
from $X_n$ and $X$ match (i.e., low $d_n = |x_n - x|$) gets higher and higher as
$n$ progresses.
This is a stronger condition compared to the convergence in distribution. Obviously, if the values drawn match, the histograms also match. This is why convergence in probability implies convergence in distribution. The converse is not necessarily true, as can be seen in Example 1.
Example 1. Let $Z \sim \mathcal{N}(0,1)$ (the standard normal random variable). Define $X_n := Z$ and $X := -Z$. It follows that $X_n \stackrel{d}{\rightarrow} X$ because we get the same histogram (i.e., the bell-shaped histogram of the standard normal) for each $X_n$ and $X$. Now, it turns out that $X_n$ does not converge in probability to $X$. This is because $(\dagger)$ does not hold. The reason is that realizations of $X_n$ and $X$ will never match, no matter how large $n$ is. Yet, they share the same histogram!
Example 2. Let $Y, Z \sim \mathcal{N}(0,1)$. Define $X_n := Z + Y/n$, and $X := Z$. It turns out that $X_n \stackrel{p}{\rightarrow} X$ (and hence $X_n \stackrel{d}{\rightarrow} X$). Read the text at $(\dagger)$ again. In this case, the values drawn from $X_n$ and $X$ are almost the same, with $x_n$ corrupted by the added noise, drawn from $Y/n$. However, since the noise is weaker and weaker as $n$ progresses, eventually $Y/n=0$, and we get the convergence in probability.
Example 3. Let $Y, Z \sim \mathcal{N}(0,1)$. Define $X_n := Z + n Y$, and $X := Z$. We see that $X_n$ does not converge in distribution. Hence, it does not converge in probability.