Is sample mean the "best" estimation of distribution mean in some sense?

Question

By (weak/strong) law of large numbers, given some iid sample points $\{x_i \in \mathbb{R}^n, i=1,\ldots,N\}$ of a distribution, their sample mean $f^*(\{x_i, i=1,\ldots,N\}):=\frac{1}{N} \sum_{i=1}^N x_i $ converges to the distribution mean both in probability and a.s., as sample size $N$ goes to infinity.

When the sample size $N$ is fixed, I wonder if the LLN estimator $f^*$ is an estimator best in some sense? For example,

its expectation is the distribution mean, so it is an unbiased estimator. Its variance is $\frac{\sigma^2}{N}$ where $\sigma^2$ is the distribution variance. But is it UMVU?
is there some function $l_0: \mathbb{R}^n \times \mathbb{R}^n \rightarrow [0,\infty)$ such that $f^*(\{x_i, i=1,\ldots,N\})$ solve the minimization problem: $$ f^*(\{x_i, i=1,\ldots,N\}) = \operatorname{argmin}_{u \in \mathbb{R}^n} \quad \sum_{i=1}^N l_0(x_i, u)? $$

In other words, $f^*$ is the best wrt some contrast function $l_0$ in the minimum contrast framework (c.f. Section 2.1 "Basic Heuristics of Estimation" in "Mathematical statistics: basic ideas and selected topics, Volume 1" by Bickle and Doksum).

For example, if the distribution is known/restricted to be from the family of Gaussian distributions, then sample mean will be the MLE estimator of distribution mean, and MLE belongs to the minimum contrast framework, and its contrast function $l_0$ is minus the log likelihood function.
is there some function $l: \mathbb{R}^n \times F \rightarrow [0,\infty)$ such that $f^*$ solve the minimization problem: $$ f^* = \operatorname{argmin}_{f} \quad \operatorname{E}_{\text{iid }\{x_i, i=1,\ldots,N\} \text{ each with distribution }P } \quad l(f(\{x_i, i=1,\ldots,N\}), P)? $$ for any distribution $P$ of $x_i$ within some family $F$ of distributions?

In other words, $f^*$ is the best wrt some lost function $l$ and some family $F$ of distributions in the decision theoretic framework (c.f. Section 1.3 "The Decision Theoretic Framework" in "Mathematical statistics: basic ideas and selected topics, Volume 1" by Bickle and Doksum).

Note that the above are three different interpretations for a "best" estimation that I have known so far. If you know about other possible interpretations that may apply to the LLN estimator, please don't hesitate to mention that as well.

The sample mean has many nice and interesting properties but sometimes they are not the best that one can have in a particular situation. One example is cases where the support of the distribution depends on the value of the parameter. Consider $X_1, X_2, \ldots, X_n \sim \mathcal{U}(0,\theta)$, then $\frac{1}{n} \sum_{i=1}^{n} X_i$ is an unbiased estimator of the distribution mean $\theta$ but it is not the UMVUE, for example, unbiased estimates based on the largest order statistic $\frac{n+1}{n}X_{(n)}$ will have smaller variance than the sample mean. — VitalStatistix, Oct 05 '11 at 01:44
The pdf of $Y=X_{(n)}$, the largest order statistic is given by, $$f(y)= \frac{ny^{n-1}}{{\theta}^n} ; y\in (0,\theta)$$, so the variance of the unbiased estimator $\frac{n}{n+1}Y$ will be, $Var(\frac{n}{n+1}Y)=\frac{1}{n(n+2)}\theta^2$, i.e. the variance is of the order of $\frac{1}{n^2}$, compared to the variance of sample mean which is of the order $\frac{1}{n}$. — VitalStatistix, Oct 05 '11 at 01:54
@VitalStatistix, am I completely missing something here? If the variables are uniform on $[0, \theta]$ their sample mean has expectation $\theta/2$, so don't you want to multiply by 2 to get an unbiased estimator of $\theta$? — NRH, Oct 05 '11 at 06:48
@NRH: Yes, thanks for pointing that out. The distribution mean is $\frac{\theta}{2}$, so, you have to multiply by 2 to get an UE for $\theta$, if we take the parameter to be $\frac{\theta}{2}$, then sample mean is still an UE but the other estimator needs to be adjusted for this factor. Sorry for the error, but the order of variance of sample mean will still be $O(\frac{1}{n})$ and the variance of the max order statistics will be $O(\frac{1}{n^2})$. — VitalStatistix, Oct 05 '11 at 11:44
Another way to characterize an estimator: Please read about Consistent Estimator [here](http://en.wikipedia.org/wiki/Consistent_estimator). Sample Mean is consistent due to LLN. — Rohit Banga, Oct 05 '11 at 01:05

DavidR · Accepted Answer · 2011-10-11T15:56:22.570

The answer to your second question is yes: The sample mean is a minimum contrast estimator when your function $l_0$ is $(x-u)^2$, when x and u are real numbers, or $(x-u)'(x-u)$, when x and u are column vectors. This follows from least-squares theory or differential calculus.

A minimum contrast estimator is, under certain technical conditions, both consistent and asymptotically normal. For the sample mean, this already follows from the LLN and the central limit theorem. I don't know that minimum contrast estimators are "optimal" in any way. What's nice about minimum contrast estimators is that many robust estimators (e.g. the median, Huber estimators, sample quantiles) fall into this family, and we can conclude that they are consistent and asymptotically normal just by applying the general theorem for minimum contrast estimators, so long as we check some technical conditions (though often this is much difficult than it sounds).

One notion of optimality that you don't mention in your question is efficiency which, roughly speaking, is about how large a sample you need to get an estimate of a certain quality. See http://en.wikipedia.org/wiki/Efficiency_(statistics)#Asymptotic_efficiency for a comparison of the efficiency of mean and median (mean is more efficient, but the median is more robust to outliers).

For the third question, without some restriction on the set of functions f over which you are finding the argmin, I don't think the sample mean will be optimal. For any distribution P, you can fix f to be a constant that ignores the $x_i$'s and minimizes the loss for the particular P. Sample mean can't beat that.

Minimax optimality is a weaker condition than the one you give: instead of asking that $f^*$ be the best function for any $P$ in a class, you can ask that $f^*$ have the best worst-case performance. That is, between the argmin and the expectation, put in a $\max_{P\in F}$. Bayesian optimality is another approach: put a prior distribution on $P\in F$, and take the expectation over $P$ as well as the sample from $P$.

Thanks! Are there some good references on properties of minimum contrast estimator, such as consistent and asymptotically normal, as well as examples such as the median, Huber estimators, sample quantiles? — Tim, Oct 05 '11 at 04:06
Section 5.2.2 of the Bickel & Doksum book you cite has a theorem on the consistency of minimum contrast estimators. Section 5.4.2 discusses asymptotic normality. Another source that I recommend, and which discusses the other estimators I mention, is van der Vaart's _Asymptotic Statistics_ book. Chapter 5 is on M-estimators, which is his name for minimum contrast estimators. — DavidR, Oct 05 '11 at 04:16
Thanks! Is the norm in your first paragraph an arbitrary one on $\mathbb{R}^n$ or must it be $l_2$ norm? — Tim, Oct 05 '11 at 04:31
I mean the standard Euclidean norm -- I've changed it to vector notation to clarify. — DavidR, Oct 05 '11 at 04:35
DavidR, thanks! (1) Regarding part 3 in my post, I wonder if the sample mean, i.e. LLN estimator, can fit into the decision theoretic framework for some loss function $l$? (2) I have the impression that all the estimators, such as MLE and Least Square Estimator, fit into the minimum contrast framework, but not the decision theoretic framework. So is the decision theoretic framework not used for constructing estimators, but only to evaluate them? — Tim, Oct 10 '11 at 16:49
@Tim: Added in some more to the answer. I think the optimality condition in your decision theoretic framework is too strong to be useful... but there are other decision theoretic frameworks that would be useful for constructing estimators. Aren't there examples in the Bickel & Doksum book? — DavidR, Oct 11 '11 at 15:58
DavidR, thanks! To your question, are the minmax and Bayesian optimalities "examples"? — Tim, Oct 11 '11 at 17:08

Is sample mean the "best" estimation of distribution mean in some sense?

1 Answers1