Why is the average the right way to deal with Gaussian noise?

Question

I have an elementary question that I'm sure must be answered in textbooks somewhere, but I haven't found it yet. Why is the average the right way to deal with Gaussian noise?

Let's flesh this out a bit. Here's my model. There is an unknown parameter $\theta$. We have iid random variables $X_1,\dots,X_n$, defined as $X_i = \theta + Y_i$ where $Y_i \sim \mathcal{N}(0, \sigma^2)$. In other words, the $Y_i$'s are Gaussian noise, and each $X_i$ is a noisy observation of the underlying parameter. Now, given observations of $X_1,\dots,X_n$, we want to estimate the underlying parameter $\theta$.

I think I remember reading that the optimal way to estimate $\theta$ is using the average: i.e., $\hat{\Theta} = \frac{1}{n} (X_1 + \dots + X_n)$. Is this true? In what sense is this optimal, and why is it optimal?

score 9 · Answer 1 · edited Apr 13 '17 at 12:44

There are several senses in which the sample mean might be regarded as optimal as an estimator of $\theta$, some of which are optimal more generally than for the normal, and then there's senses in which it's optimal specifically for the normal.

If we take the Gauss-Markov theorem, the sample mean is the best linear unbiased estimator of the population mean. (It's not always the case that a linear estimator is desirable, but linear estimators are intuitive and have some nice properties - and if you want an unbiased linear estimator of the population mean, the sample mean is the best one). This doesn't rely on normality.

Now, desirable properties of estimators might include having small variance, having small (or zero) bias, perhaps approaching $\theta$ as $n$ becomes large (consistency), 'using all the information in the data' (sufficiency).

When bias isn't zero, a useful way to compare estimators might be by mean square error (which turns out to be variance plus the square of the bias). When the bias is zero, you might compare them by variance. These are far from the only possible desirable properties (you might prefer small absolute error over small squared error for example).

When the data are normal, the sample mean is the maximum likelihood estimator.

MLE's are generally (under some conditions) consistent, sufficient, and asymptotically normal (glossing over many details).

The sample mean is unbiased for the population mean (again, omitting some conditions), so one thing we might be interested in for the normal case is how its variance compares to other possible estimators.

Asymptotically, MLEs will achieve the Cramer-Rao lower bound, so in large samples, they're often going to be as good as you can do. In small samples they may be biased (in fact, MLEs are usually biased), and may not be minimum MSE in small samples.

In the case of the normal, the sample mean achieves the Cramer-Rao lower bound at every $n$ (the bound in this case is simply $\sigma^2/n$); if you like minimum variance unbiased estimation, you can't do better.

Many books cover these issues at a fairly elementary level, not requiring much more than some basic calculus and algebra.

score 1 · Answer 2 · answered Jan 08 '14 at 20:59

Here's the argument that the average is the maximum-likelihood estimator (MLE) for the parameter $\theta$.

Suppose we are given observations $x_1,\dots,x_n$ of the r.v.s $X_1,\dots,X_n$. The likelihood of $\hat{\theta}$, given $x_1,\dots,x_n$, is

$$L(\hat{\theta}) = \prod_i \Pr[X_i=x_i|\hat{\theta}] = \prod_i \Pr[Y_i=x_i-\hat{\theta}] = \prod_i e^{-(x_i-\hat{\theta})^2/2\sigma^2}.$$

Thus, the log-likelihood is

$$\log L(\hat{\theta}) = -\frac{1}{2\sigma^2} \sum_i (x_i-\hat{\theta})^2.$$

We want to maximize this value. When maximizing, we can ignore the constant term out front, so our goal is: given $x_1,\dots,x_n$, find $\hat{\theta}$ that minimizes the sum

$$f(\hat{\theta}) = \sum_i (x_i-\hat{\theta})^2.$$

We can find the minimum by taking the first derivative and setting it to zero. Notice that

$$f'(\hat{\theta}) = -2 \sum_i (x_i - \hat{\theta}).$$

Setting $f'(\hat{\theta})$ to zero yields the condition

$$\sum_i (x_i - \hat{\theta}) = 0.$$

Re-arranging terms, we find

$$\hat{\theta} = \frac{1}{n} \sum_i x_i.$$

In other words, the value of $\hat{\theta}$ that maximizes the likelihood score is precisely the average of the observed values of $X_1,\dots,X_n$.

score 1 · Answer 3 · answered Jan 08 '14 at 23:48

A quick remark in addition to the excellent answers above: the type of situation you describe is often discussed in classical test theory. In CTT you regard $X$ as independent repeated measures of the true score $\theta$. The major difference between your question and CTT is that $\theta$ is assumed to have a distribution in CTT (i.e. the true score is not constant in the population). A good starting point is: McDonalds, R. (1999). Test Theory.

Hope this helps.

Why is the average the right way to deal with Gaussian noise?

3 Answers3