Consistency in mean square vs. "normal" consistency

Question

I have an estimator $\theta$ for the mean $\mu$. I understand consistency such that $\theta$ converges in probability to $\mu$ as $n$ goes to infinity.

Now, I encountered another concept, consistency in mean sqaure.

$\theta$ is consistent in mean square if: $\mathbb{E}[(\theta-\mu)^2] \rightarrow 0 $ as $n \rightarrow \infty$.

In my material, it says that:

If, as $n \rightarrow \infty$, it holds that $\mathbb{E}[(\theta-\mu)^2] \rightarrow 0$, then $\theta \rightarrow (\text{in probability})\ \mu$.

I read this as mean square consistency implying convergence in probability (and hence "normal" consistency). That is, mean square consistency being the stronger form of consistency. Q1: Is this reading correct?

This leads me to Q1: if the answer to Q1 was yes, why would that be so? I find this pretty counterintuitive (I would expect small deviations which, if squared, become even smaller, hence mean square consistency should be weaker according to my intuition).

Alecos Papadopoulos · Accepted Answer · 2017-09-13T20:30:20.513

ADDENDUM 30-3-2017 An important clarification: None of the derivations below guarantees that "$\mu$" is the "true value" we are attempting to estimate. What we only show is that if $\theta_n$ converges in $L^2$ to some constant, then this constant is also its probability limit.

Whether the probability limit of the estimator is the true value, and so whether the estimator is consistent, is not proven here. So the whole derivation below presupposes that $\mu$ is, after all, the "true value".

$\newcommand{\E}{\mathbb{E}}$

Assume that we don't know whether $\mu$ is the mean, or the probability limit etc, of our estimator $\theta_n$. We can write $$\E[(\theta_n-\mu)^2] = \mu^2 - 2\E(\theta_n)\mu + \E(\theta_n^2)$$

which we can view as a quadratic polynomial in $\mu$. To obtain convergence in $L^2$ we need this quadratic to not be bounded away from zero, as a necessary condition. Being a quadratic, we can easily examine its roots.

Its discriminant is $$\Delta_{\mu} = 4[\E(\theta_n)]^2 - 4\E(\theta_n^2) = -4\text{Var}(\theta_n)$$

We want the discriminant to be greater or equal than zero, otherwise the polynomial won't have real roots. Since the variance is non-negative, we need at least asymptotically for $\text{Var}(\theta_n) \to 0$. Given this we then have asymptotically the double root

$$\mu = \lim \E(\theta_n)$$

So if $\E[(\theta_n-\mu)^2] \to 0$, it means that $\text{Var}(\theta_n) \to 0$ and $\lim \E(\theta_n) = \mu$.

These are sufficient conditions for consistency (sufficient but not necessary though, either because the variance may not even exist, or because of situations like this one). [And again, they are sufficient for consistency if we assume from the start that $\mu$ is the true value we are trying to estimate.]

And why these conditions should be sufficient for consistency? What do they have to do with the probability statement

$$\Pr(|\theta_n -\mu| > \varepsilon) \to 0$$

Well, as another answer mentioned, this probability is tied to the variance of the distribution by Chebyshev's Inequality, so if $\mu$ is the asymptotic expected value of $\theta_n$ then

$$\Pr(|\theta_n -\mu| > \varepsilon) \leq \frac{\text{Var}(\theta_n)}{\varepsilon^2} $$

So if $\lim \E(\theta_n) = \mu$ Chebyshev's Inequality is applicable, and then if $\text{Var}(\theta_n) \to 0$, the probability goes to zero.

And so intuition for $L^2$-convergence being sufficient for consistency, is in my view shifted to whether we understand intuitively Chebyshev's Inequality...

...because here too, the intellectual objection of the OP appears too: a not-squared difference appears bounded by a squared difference, but which "for small deviations" (smaller than unity) "is smaller". Well, the "intervening" operators (Probability, Expected Value) have a lot to do with it, since ($I\{\}$ being the indicator function),

$$\Pr(|\theta_n -\mu| > \varepsilon) =\E\left(I\{|\theta_n -\mu| > \varepsilon\}\right) $$

$$= \E\left(I\left \{\frac{(\theta_n -\mu)^2}{\varepsilon^2} >1 \right\}\right) \leq \E\left(\frac{(\theta_n -\mu)^2}{\varepsilon^2} \right)$$

...and this last inequality holds because

$$I\left \{\frac{(\theta_n -\mu)^2}{\varepsilon^2} >1 \right\} \leq \frac{(\theta_n -\mu)^2}{\varepsilon^2} $$

and it is when I saw the above and realized why this last inequality holds, that I gain some intuition on Chebyshev's Inequality.

Wow, thanks, this really did connect a lot of dots in my head :) — cecefuss, Mar 29 '17 at 16:09

score 3 · Answer 2 · edited Jun 11 '20 at 14:32

This seems to really be more of a question about mathematical probability, so I will ignore the statistical context of your question in this answer. Please let me know if this is not helpful enough.

1. Yes, it is true that converge in mean square, also called convergence in $L^2$, implies convergence in probability.

This is, for example, the statement of Lemma 2.2.2., p. 54 of Durrett's Probability - Theory and Examples, 4th edition.

2. Regarding intuition, you may be thinking of "convergence almost surely" when you say "convergence in probability". It is true that convergence in mean square does not imply convergence almost surely. (But the converse isn't true either, see here.)

Otherwise, all I can do is restate the formal definitions and the proof that convergence in mean square implies convergence in probability (from Durrett) to explain "why" -- that is not to say that there isn't an intuitive explanation for this, just to say that I don't have an intuitive understanding.

The basic idea, though, is just "Markov's inequality".

(Note: Durrett calls "Chebyshev's inequality" what most people refer to as "Markov's inequality" -- what most people refer to a "Chebyshev's inequality is a special case of "Markov's inequality".)

This is all quoted from the first pages of section 2.2. Weak Laws of Large Numbers.

We say that $Y_n$ converges to $Y$ in probability if for all $\varepsilon > 0$, $\mathbb{P}(|Y_n - y| > \varepsilon) \to 0$ as $n \to \infty$...

Lemma 2.2.2. If $p > 0$ and $\mathbb{E}|Z_n|^p \to 0$ then $Z_n \to 0$ in probability.

Proof. Chebyshev's inequality [monotonic version of Markov's inequality] with $\varphi(x)=x^p$ and $X= |Z_n|$ implies that if $\varepsilon > 0$ then $$\mathbb{P}(|Z_n| \ge \varepsilon) \le \varepsilon^{-p} \mathbb{E}|Z_n|^p \to 0 \,. \quad \quad\square$$

The proof of the form of Markov's inequality used in the proof can be found e.g. here.

Note, in the first version of this answer, I said that the reference was the 3rd edition of Durrett, but it is actually the 4th edition. Hopefully this did not cause any problems. — Chill2Macht, Mar 29 '17 at 17:10

score 2 · Answer 3 · answered Apr 21 '21 at 14:21

2

I have a very straight forward answer to this from a machine learning perspective. So given $\theta$ is mean square consistent, a reformulation is (Bias - Variance tradeoff): \begin{equation} \mathbb{E}\left[(\theta-\mu)^2\right] \to 0 \Leftrightarrow Var\left[\theta\right] + Bias\left(\theta,\mu\right)^2 \to 0 \end{equation}
as $n \to \infty$. As both terms are nonnegative, it implies that $Var\left[\theta\right] \to 0$ and $Bias\left(\theta,\mu\right)^2 \to 0$. Now, as mentioned above, with help of Chebyshev's inequality, convergence in probability is induced.

answered Apr 21 '21 at 14:21

qwert

21
1

It's a very useful property. But what do you mean by $Bias(\theta, \mu)$. I know only $Bias$ of one variable: $Bias(\mu) = E(\hat \mu) - \mu$. – ElonMuskofBadIdeas Jun 01 '21 at 07:30
1

The bias of $\theta$ as estimator of $\mu$. In your example substitute $\hat{\mu}$ with $\theta$ - that is what is meant above. – qwert Jun 02 '21 at 08:39

Consistency in mean square vs. "normal" consistency

3 Answers3

Linked