Just "take the average" they say. It's not that straightforward, right?

Question

I have an acquaintance who does not study statistics and doesn't understand that summing data and dividing by the number of data is a summary statistic, i.e. that information is lost.

For example, say that there are data which are measurements of some sort: $x_1, ..., x_{100}$. The most common measure of centrality is $$\hat{\mu}_1 = \bar{x} = \frac{1}{100}(x_1 + ... + x_{100})$$ However, if the data are skewed then $$\hat{\mu}_2 = \tilde{x} = \frac{1}{2}(x_{(50)} + x_{(51)})$$ is a better estimator. Then, of course there is the midrange $$\hat{\mu}_3 = \frac{1}{2}(x_{(1)} + x_{(100)})$$ that is an option.

My question is: defining some loss function— for simplicity's sake L2-loss— how to judge which $\hat{\mu}$ is best? Obviously the answer is specific to the data, but what is the MSE of the midrange, for example?

Good question, but I believe once you've said L2-loss is what you care about, $\hat \mu_1$ is right by definition. I'm hard pressed to say that I would ever use the midrange, because min & max are not very stable; I might consider the midhinge (mean of 1st & 3rd quartile), though. You might be interested in reading this: [Which “mean” to use and when?](http://stats.stackexchange.com/q/23117/7290) — gung - Reinstate Monica, Jul 13 '15 at 21:53
$\hat \mu_2$ is a strange statistic. If you intend it to be the median, then change "$49$" to "$51$". @Gung The midrange leads to low expected losses for many loss functions when the underlying distribution is close to uniform or symmetrically u-shaped. — whuber, Jul 13 '15 at 22:28
Your question in the last paragraph "*defining some loss function— for simplicity's sake L2-loss— how to judge which $\hat{μ}$ is best?*" was unclear to me. It seems to be missing some words. However, if you define a loss function, then that will define "what's best" (whatever minimizes the loss function). The MSE of the midrange depends on the distribution (but whether it's best depends on the loss function, since that defines what "best" means - why use MSE to compare, rather than the loss function?) — Glen_b, Jul 13 '15 at 22:34
@ScouserInTrousers what condition are using to define the median as a better estimator? The median is less efficient than the mean. The halfpoint of the range even moreso. — AdamO, Jul 13 '15 at 22:45

score 13 · Accepted Answer · answered Jul 13 '15 at 22:47

This is not a direct answer to your question about loss functions, but I am a Statistician, and I use the jargon of my domain, not the jargon of machine learning. I will attempt to answer the question: "which statistic is the best estimator of the population mean?"

It's incorrect to say generally that the arithmetic mean results in a loss of information. In fact in some circumstances, it can be proven that the arithmetic mean or some function of it contains as much information (Fisher information) as the data themselves. This is the concept of a sufficient statistic, i.e. some summary of the data that is sufficient for the data.

For example, if you know that your data follow a Poisson distribution then the sufficient statistic is $T(X) = X_1 + ... + X_n$. Which is simply the sum of the data. For a Normal distribution where you know the variance then the arithmetic sample mean is the sufficient statistic for the population mean. That is, it contains all of the information and no other statistic will do better. Now granted, we are never in the situation where we know are data are normally distributed and happen to know exactly the variance. But that is why we have the central limit theorem. Even for skewed data, if what you really care about is the population mean, then the arithmetic mean is pretty good bet, especially if you have a lot of observations. So to that end, I would say in a lot of circumstances, especially when you have a lot of observations the arithmetic mean is best if what you care about is the population mean.

Now, if you happen to be in the privileged position of knowing your data come from some other distribution, perhaps some pathological negative exponential distribution, then you're correct there may be a better sufficient statistic. In that circumstance the sufficient statistic for $\mu$ is the minimum observation. This is favorite example of Mukhopadhyay in Probability and Statistical Inference and you will find all exercises you can stomach in there to demonstrate.

To answer your question more generally, about how to choose the best statistic: plot your data. Look at it. Think about where it came from and how it was collected. Think about what it is you are actually trying to make inference on, and whether the way these data were collected is actually appropriate for that. Think about the form your data take: Are they strictly integer data? Proportions with a known denominator? Are they skewed, if so would a log-normal make a for a good approximation? Choose a parametric family that seems to satisfy and caveat if you must.

"I will attempt to answer the question: 'which statistic is the best estimator of the population mean?'"--that restatement really is key here, otherwise the OP's question is ill-posed. — Andrew M, Jul 14 '15 at 05:58

score 2 · Answer 2 · answered Jul 14 '15 at 23:24

Piggy-backing / building off of Dalton's answer:

Your question, as posed, is incomplete. "Information" as a statistical concept is only defined with reference to an unknown parameter as well as some statistic (function of the full data)—including the degenerate case of the full data. Exactly no statistic of the data has more information about any parameter than the full data, but sometimes statistics have just as much information as the full data about a specific parameter.

Your intuition seems to be that summary statistics reduce the informational content of the full data, but again, there is no informational content to even the full data unless it is with reference to some parameter. It is true that, if you are estimating the variance of a Normal population, the statistic $S^2$ (sample variance) has no informational content about $\mu$ (the population mean). But as pointed out above, $\bar{X}$ contains equal information about $\mu$ as does the full data.

Your question is incomplete because the intuitive, casual definition of "information" is very different from its mathematical definition. For any specific real-world scenario, you will of course have to judge what is an appropriate assumption for the distribution (accounting for skewness, support, etc) and, consequently, what specific statistics preserve the information you need.

As an aside, $\bar{X}$ gives you no information about the variance of a Normal($\mu$, $\sigma^2$) distribution, either, but knowing the pair $(\bar{X}, S^2)$ provides the same information as the full data about both parameters.

Just "take the average" they say. It's not that straightforward, right?

2 Answers2