53

Stein's Example shows that the maximum likelihood estimate of $n$ normally distributed variables with means $\mu_1,\ldots,\mu_n$ and variances $1$ is inadmissible (under a square loss function) iff $n\ge 3$. For a neat proof, see the first chapter of Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction by Bradley Effron.

This was highly surprising to me at first, but there is some intuition behind why one might expect the standard estimate to be inadmissible (most notably, if $x \sim \mathcal N(\mu,1)$, then $\mathbb{E}\|x\|^2\approx \|\mu\|^2+n$, as outlined in Stein's original paper, linked to below).

My question is rather: What property of $n$-dimensional space (for $n\ge 3$) does $\mathbb{R}^2$ lack which facilitates Stein's example? Possible answers could be about the curvature of the $n$-sphere, or something completely different.

In other words, why is the MLE admissible in $\mathbb{R}^2$?


Edit 1: In response to @mpiktas concern about 1.31 following from 1.30:

$$E_\mu\left(\|z-\hat{\mu}\|^2\right)=E_\mu\left(S\left(\frac{N-2}{S}\right)^2\right)=E_\mu\left(\frac{(N-2)^2}{S}\right).$$

$$\hat{\mu_i} = \left(1-\frac{N-2}{S}\right)z_i$$ so $$E_\mu\left(\frac{\partial\hat{\mu_i}}{\partial z_i} \right)=E_\mu\left( 1-\frac{N-2}{S}+2\frac{z_i^2}{S^2}\right).$$ Therefore we have:

$$2\sum_{i=1}^N E_\mu\left(\frac{\partial\hat{\mu_i}}{\partial z_i} \right)=2N-2E_\mu\left(\frac{N(N-2)}{S}\right)+4E_\mu\left(\frac{(N-2)}{S}\right)\\=2N-E_\mu\frac{2(N-2)^2}{S}.$$

Edit 2: In this paper, Stein proves that the MLE is admissible for $N=2$.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Har
  • 1,494
  • 11
  • 15
  • I am not familiar with this, but I read the pdf, and some things are quite bizarre. First, we observe one $N$-vector and we want to make inference on $N$ parameters, which is a tad optimistic. Second, the MLE estimate is simply the observation itself, no wonder it does not have the required properties. And third, the equation 1.31 does not follow from the 1.30. In fact I get $E_\mu(N-2)^2/S+2E_\mu(N-2)/S-N$, which is not $N-E_\mu((N-2))^2/S$. And fourthly the last term in the last expression is positive for $N=1$ also so the last statement after the equation 1.31 in the book is ... – mpiktas Jul 26 '11 at 10:48
  • ... not strictly true. Probably I am missing something trivial, but it would be nice if you added some more context into the question, since it seems that it matters a lot. – mpiktas Jul 26 '11 at 10:50
  • It is probably the same narrow place of understanding that OLS is just a best **linear unbiased** estimator (BLUE), and there exist better biased or nonlinear examples like James-Stein estimator or LASSO estimator that minimizes MSE further. – Dmitrij Celov Jul 26 '11 at 10:52
  • @Har, thanks, I thought I am missing something obvious, I missed the term in differentiating. Still this leaves my other 3 points. – mpiktas Jul 26 '11 at 11:56
  • @Har, I think the footnote in the 5th page might give some clue. According to it MLE estimate can be improved everywhere (whatever that is), and that this specific example was developed later. So it seems that there is no mystery concerning dimension, it is just this specific example. Probably it is possible to think of another estimator which specifically is better than MLE in dimension 2. – mpiktas Jul 26 '11 at 12:02
  • @mpiktas 1) Yes, the setup is purely theoretical. But still really important (for example, we can let the variables be means of i.i.ds). 2) Yes, although I don't know what you mean with required properties. 3) See above. 4) True, but the expected value of $1/S$ is not defined for $N=1$. (the inverse chi square distribution doesn't have a mean for $N=1$). – Har Jul 26 '11 at 12:03
  • Sorry, I should have explicitly mentioned that Stein proved that for $N=2$, the MLE is admissible! See http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.bsmsp/1200501656 – Har Jul 26 '11 at 12:04
  • 4
    @mpiktas It isn't as inapplicable as it looks. The situation is similar to an ANOVA after we apply a sufficiency reduction. This hints that the usual ANOVA estimates of the group means are inadmissible provided we are trying to estimate the means of more than 3 groups (which turns out the be true). I would recommend looking at proofs that the MLE is admissible for $N = 1, 2$ and seeing where they fail when trying to extend to $N = 3$ rather than just looking at proofs that Stein's estimator does what it claims to do, which is easy once you actually have the estimator in mind. – guy Jul 26 '11 at 15:29
  • 2
    ...and know to use Stein's Lemma. I guess it's actually a little less straight forward than I though 6 minutes ago. – guy Jul 26 '11 at 15:35
  • 2
    I agree. Do you have any good references for that (aside from the original paper). I found Stein's original paper overly computational and was hoping that someone would have developed a different method in the last fifty years. – Har Jul 26 '11 at 15:45
  • 2
    The proof that I was taught was that of Brown and Hwang from 1983 which uses a method suggested by Blyth from the early 1950's I believe. It is pretty general (more general than Stein's result in that it works for the exponential family) and, I believe, quite different from Stein. But it isn't trivial. – guy Jul 26 '11 at 17:57
  • Just add that $\mathbb{E}||x||^2\ge ||\mu||^2$ holds independently of the normal distribution, because $||x||^{2}=\sum_i x_i^2$ and we have $E(x_i^2)=var(x_i)+\mu_i^2\geq\mu_i^2$. Result follows, as long as we assume first and second moment exist and are finite (and real). – probabilityislogic Jul 27 '11 at 17:18
  • @probabilityislogic: One only needs the first moment to exist for this result to hold since $\mathbb{E} X^2 \geq (\mathbb{E}X)^2$ by Jensen's inequality. – cardinal Jul 27 '11 at 19:55
  • @probabilityislogic and cardinal Would you guys mind elaborating? :) I'm curious but I don't see exactly what you mean. – Har Jul 27 '11 at 21:20
  • cardinal - of course. @Har - I was just pointing out that you don't require much assumptions to have $\mathbb{E}||x||^2\ge ||\mu||^2$ satisfied. So if this condition is the main condition for Stein's phenomenon/paradox to apply (which the wording appears to be hinting at), then it is much more generally true than when $x$ is normally distributed. So the condition $x\in N(\mu,1)$ could be replaced with the condition $E(|x_i|) – probabilityislogic Jul 27 '11 at 23:17
  • As you point out, it is of course not the main condition, only one of the reasons one might suspect that shrinking the estimate is a good idea. In Stein's original paper, he takes this as the starting point for the intuitive discussion and shows that the problem gets even worse in higher dimensions. I'll update the text accordingly. – Har Jul 28 '11 at 13:19
  • 1
    @prob: It's a very minor point, but, just to clarify, as I originally stated: The mean need only *exist*, it need not be finite. :) – cardinal Jul 28 '11 at 13:34
  • 2
    @Har great question! (+1) – suncoolsu Jul 28 '11 at 18:46
  • I will try to post an answer a little bit later, though I suspect it's not exactly what you are looking for. – cardinal Jul 28 '11 at 21:36
  • That'd be great. I'll try to share my thoughts as well, which are not exactly enlightened, but at least somewhat clearer than a few days ago. – Har Jul 29 '11 at 07:36

2 Answers2

49

The dichotomy between the cases $d < 3$ and $d \geq 3$ for the admissibility of the MLE of the mean of a $d$-dimensional multivariate normal random variable is certainly shocking.

There is another very famous example in probability and statistics in which there is a dichotomy between the $d < 3$ and $d \geq 3$ cases. This is the recurrence of a simple random walk on the lattice $\mathbb{Z}^d$. That is, the $d$-dimensional simple random walk is recurrent in 1 or 2 dimensions, but is transient in $d \geq 3$ dimensions. The continuous-time analogue (in the form of Brownian motion) also holds.

It turns out that the two are closely related.

Larry Brown proved that the two questions are essentially equivalent. That is, the best invariant estimator $\hat{\mu} \equiv \hat{\mu}(X) = X$ of a $d$-dimensional multivariate normal mean vector is admissible if and only if the $d$-dimensional Brownian motion is recurrent.

In fact, his results go much further. For any sensible (i.e., generalized Bayes) estimator $\tilde{\mu} \equiv \tilde{\mu}(X)$ with bounded (generalized) $L_2$ risk, there is an explicit(!) corresponding $d$-dimensional diffusion such that the estimator $\tilde{\mu}$ is admissible if and only if its corresponding diffusion is recurrent.

The local mean of this diffusion is essentially the discrepancy between the two estimators, i.e., $\tilde{\mu} - \hat{\mu}$ and the covariance of the diffusion is $2 I$. From this, it is easy to see that for the case of the MLE $\tilde{\mu} = \hat{\mu} = X$, we recover (rescaled) Brownian motion.

So, in some sense, we can view the question of admissibility through the lens of stochastic processes and use well-studied properties of diffusions to arrive at the desired conclusions.

References

  1. L. Brown (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Stat., vol. 42, no. 3, pp. 855–903.
  2. R. N. Bhattacharya (1978). Criteria for recurrence and existence of invariant measures for multidimensional diffusions. Ann. Prob., vol. 6, no. 4, 541–553.
cardinal
  • 24,973
  • 8
  • 94
  • 128
  • 3
    Actually, something like this is what I hoped to. A connection to another field of mathematics (be it differential geometry or stochastic processes) which shows that the admissibility for $n=2$ was not just a fluke. Great answer! – Har Jul 30 '11 at 15:34
  • Inspired by your answer, I provided some details and also add a geometric explanation in response to this problem on MO: http://mathoverflow.net/questions/93745/the-james-stein-estimator-counterintuitive-estimation-of-the-mean-what-means – Henry.L Apr 11 '17 at 20:08
30

@cardinal gave a great answer (+1), but the whole issue remains mysterious unless one is familiar with the proofs (and I am not). So I think the question remains as to what is an intuitive reason that Stein's paradox does not appear in $\mathbb R$ and $\mathbb R^2$.

I find very helpful a regression perspective offered in Stephen Stigler, 1990, A Galtonian Perspective on Shrinkage Estimators. Consider independent measurements $X_i$, each measuring some underlying (unobserved) $\theta_i$ and sampled from $\mathcal N(\theta_i, 1)$. If we somehow knew the $\theta_i$, we could make a scatter plot of $(X_i, \theta_i)$ pairs:

Stein's paradox: regression perspective

The diagonal line $\theta = X$ corresponds to zero noise and perfect estimation; in reality the noise is non-zero and so the points are displaced from the diagonal line in horizontal direction. Correspondinly, $\theta = X$ can be seen as a regression line of $X$ on $\theta$. We, however, know $X$ and want to estimate $\theta$, so we should rather consider a regression line of $\theta$ on $X$ -- which will have a different slope, biased horizontally, as shown on the figure (dashed line).

Quoting from the Stigler's paper:

This Galtonian perspective on the Stein paradox renders it nearly transparent. The "ordinary" estimators $\hat \theta_i^0 = X_i$ are derived from the theoretical regression line of $X$ on $\theta$. That line would be useful if our goal were to predict $X$ from $\theta$, but our problem is the reverse, namely to predict $\theta$ from $X$ using the sum of squared errors $\sum (\theta_i - \hat \theta_i)^2$ as a criterion. For that criterion, the optimum linear estimators are given by the least squares regression line of $\theta$ on $X$, and the James-Stein and Efron-Morris estimators are themselves estimators of that optimum linear estimator. The "ordinary" estimators are derived from the wrong regression line, the James-Stein and Efron-Morris estimators are derived from approximations to the right regression line.

And now comes the crucial bit (emphasis added):

We can even see why $k\ge 3$ is necessary: if $k=1$ or $2$, the least squares line of $\theta$ on $X$ must pass through the points $(X_i, \theta_i)$, and hence for $k=1$ or $2$, the two regression lines (of $X$ on $\theta$ and of $\theta$ on $X$) must agree at each $X_i$.

I think this makes it very clear what is special about $k=1$ and $k=2$.

amoeba
  • 93,463
  • 28
  • 275
  • 317