How can a statistician who has the data for a non-normal distribution guess better than one who only has the mean?

Question

Let's say we have a game with two players. Both of them know that five samples are drawn from some distribution (not normal). None of them know the parameters of the distribution used to generate the data. The goal of the game is to estimate the mean of the distribution. The player that comes closer to the true mean wins 1\$ (absolute difference between estimated value and actual value is the objective function). If the distribution has a mean that blows up to $\infty$, the player guessing the larger number wins and for $-\infty$, the one guessing the smaller number.

While the first player is given all five samples, the second one is given just the sum of the samples (and they know there were five of them).

What are some examples of distributions where this isn't a fair game and the first player has an advantage? I guess the normal distribution isn't one of them since the sample mean is a sufficient statistic for the true mean.

Note: I asked a similar question here: Mean is not a sufficient statistic for the normal distribution when variance is not known? about the normal distribution and it was suggested I ask a new one for non-normal ones.

EDIT: Two answers with a uniform distribution. I would love to hear about more examples if people know of any.

The question can be phrased as determining some distributions where the sample mean is _Pitman_ farther than another estimator — user257566, Jul 21 '21 at 01:41
Lemma 2.1 here gives a family of dominating estimators for any non negative distributions, simply by transforming the sample mean https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=11204&context=rtd — user257566, Jul 21 '21 at 01:48
This article by Kubokawa gives an example for a particular normal distribution. ism.ac.jp/editsec/aism/pdf/041_3_0477.pdf — user257566, Jul 21 '21 at 01:56
@user257566 - can't open the link by pasting. Can you hyper-link? — ryu576, Jul 21 '21 at 07:15

score 22 · Accepted Answer · edited Jul 18 '21 at 22:19

22

For a uniform distribution between $0$ and $2 \mu$, the player who guesses the sample mean would do worse than one which guesses $\frac{3}{5} \max(x_i)$ (the sample maximum is a sufficient statistic for the mean of a uniform distribution lower bounded by 0).

In this particular case, it can be verified numerically. Without loss of generality, we set $\mu = 0.5$ in the simulation. It turns out that about 2/3rds of the time, the 3/5 max estimator does better.

Here is a Python simulation demonstrating this.

import numpy as np
Ntrials = 1000000
xs = np.random.random((5,Ntrials))
sample_mean_error = np.abs(xs.mean(axis=0)-0.5)
better_estimator_error = np.abs(0.6*xs.max(axis=0)-0.5)
print((sample_mean_error > better_estimator_error).sum())

edited Jul 18 '21 at 22:19

ryu576

2,220
1
16
25

answered Jul 18 '21 at 19:24

shimao

22,706
2
42
81

7

Could you expand your answer why this happens, mathematically? And what the optimal $c$ is when guessing $c\max(x_i)$ when guessing from $n$ samples rather than just 5? – orlp Jul 19 '21 at 10:42
1

Do you think $0.5743492$ (half the reciprocal of the median of a Beta$(5,1)$ distribution) might do better than $0.6$? – Henry Jul 19 '21 at 14:18
For $n$ samples and $\mu=0.5,$ the optimal value is $c= \frac{n+1}{2n}.$ This is because the expected value of the maximum is $\frac{n}{n+1}$ and you want to estimate $\mu$ rather than $2 \mu.$ – soakley Jul 19 '21 at 16:36
2

@soakley only if you insist on an unbiased estimator. But there can be another estimator which is more likely to be closer (and indeed there is in this example) despite being biased – Henry Jul 19 '21 at 17:12
Supposing there is a better estimator (based on the absolute value criterion), can you know it based on the information given? It is not clear to me that the players would know, for example, that the population is uniform. – soakley Jul 19 '21 at 18:17
1

@Henry Based on my calculations, the value of $c$ that minimizes the mean absolute deviation of estimators of the form $W(\mu) = c x_{(n)}$ is $$c = 2^{-n/(n+1)}.$$ So in the case of $n = 5$, this gives $c = 2^{-5/6} \approx 0.561231$. This of course is biased for $\mu$. We get the lower bound $$\operatorname{E}[|W(\mu) - \mu|] \ge \mu (-1 + 2^{1/(n+1)}).$$ By contrast, for $n = 5$, the sample mean $\bar X$ has expected mean absolute deviation $\frac{1199}{5760}\mu$, which is strictly greater. – heropup Jul 20 '21 at 09:21
1

@heropup My simulations suggest that $0.5743492=2^{-4/5}$ does better than $0.561231=2^{-5/6}$ in about $52.9\%$ of cases – Henry Jul 20 '21 at 09:30
@Henry My simulations agree with yours, so I must have done something wrong. I will check my calculations tomorrow. – heropup Jul 20 '21 at 09:43
Is this only true for 5 data points? What if you have more? Does it tend to equality when tending to infinity? – NotThatGuy Jul 20 '21 at 11:53
@NotThatGuy There is nothing special about $5$ points here - the same kind of thing would happen with any number greater than $1$ though with different fractions. As the number of points increases, any consistent estimator will tend to the true value, but in this example those based on the sample maximum will usually perform better than those based on the sample mean – Henry Jul 20 '21 at 16:17
Would your example still apply if we we facing e.g. square loss in place of absolute loss? – Richard Hardy Jul 20 '21 at 17:19
@RichardHardy square loss of one value, and we only care about whose is larger? No difference. $|a-\mu|>|b-\mu|$ implies $(a-\mu)^2>(b-\mu)^2$ and vice versa – user253751 Jul 20 '21 at 17:32
1

@user253751, for a single instance your argument makes sense. Meanwhile, I am thinking about expected values of losses defined as absolute errors vs. square errors. Generally, this makes a difference. (The answers given so far illustrate nothing else but expected values using simulations.) – Richard Hardy Jul 20 '21 at 18:34
Suppose for a moment Player 2 also had access to the sample maximum. As the pivotal quantity $Q=\frac{\max_{i=1}^5(X_i)}{(2\mu)}$ follows a beta distribution $Q\sim\mathrm B(5,1)$, its median is rather higher than the mean: so if she correctly guessed Player 1 would use the mean-unbiased estimator of $\mu$ she can beat him $1-F_\mathrm{B}(5,1) \approx 60\%$ of the time with the estimator $(1-\epsilon)\frac{5}{3}\cdot\max_{i=1}^5(X_i)$ (where $\epsilon$ is very small). Of course if Player 1's wise to that tactic, he'll use the estimator ... – Scortchi - Reinstate Monica Jul 26 '21 at 10:49
... $(1-\epsilon)^2\frac{5}{3}\cdot\max_{i=1}^5(X_i)$, & gain almost the advantage Player 2 had hoped for ... in fact the play-it-safe strategy for both players would be to use the median-unbiased estimator $\frac{1}{2F_\mathrm{B}^{-1}\left(\frac{1}{2}; 5, 1\right)}\cdot\max_{i=1}^5(X_i)$. The above is something of a digression, but returning to the case where Player 2 has only the sample mean to work with - is it enough to show that one particular estimator that's a function of the maximum beats one particular estimator that's a function of the mean, ... – Scortchi - Reinstate Monica Jul 26 '21 at 10:49
... & to conclude that the game's unfair to Player 2? – Scortchi - Reinstate Monica Jul 26 '21 at 11:01

score 11 · Answer 2 · edited Jul 21 '21 at 07:05

The sum of observations is not sufficient for estimating the mean of a uniform population. The midrange has a smaller expectation of absolute error.

Approximation by simulation in R:

    set.seed(2021)
    a = replicate(10^6, mean(runif(5)))
    mr = replicate(10^6, mean(range(runif(5))))
    mean(a);  mean(mr)
    [1] 0.5000905
    [1] 0.5000926
    mean(abs(a-.5)); mean(abs(mr-.5))
    [1] 0.1040754
    [1] 0.0833201

    par(mfrow=c(2,1))
    hdr1 = "UNIF(0,1): Simulated Dist'n of Mean of 5"
    hist(a, prob=T, xlim=0:1, br=30, col="skyblue2", main=hdr1)
    hdr2 = "UNIF(0,1): Sim. Dist'n of Midrange of 5"
    hist(mr, prob=T, xlim=0:1, br=30, col="skyblue2", main=hdr2)
    par(mfrow=c(1,1))

Note per Comment: Using mean squared error instead of absolute error. Also, with RMSE for comparable units.

    mean((a-.5)^2); mean((mr-.5)^2)
    [1] 0.01665874
    [1] 0.01190478

    sqrt(mean((a-.5)^2)); sqrt(mean((mr-.5)^2))
    [1] 0.1290687
    [1] 0.109109

Would your example still apply if we we facing e.g. square loss in place of absolute loss? — Richard Hardy, Jul 20 '21 at 17:18
@RichardHardy because we only care about who was closer, it makes no difference. — user253751, Jul 20 '21 at 17:33
@RichardHardy. Using sq err loss, for UNIF, midrange also smaller than mean. See Addendum. — BruceET, Jul 20 '21 at 18:14

score 9 · Answer 3 · answered Jul 19 '21 at 03:50

9

It might be worth adding that while you can often do better for low-dimensional parametric families, you can't do better if the distribution is completely unknown (or completely unknown apart from knowing it has a finite mean). The mean is the only estimator of the mean that works over all distributions.

answered Jul 19 '21 at 03:50

Thomas Lumley

21,784
1
22
73

1

Is there no distribution where the mean blows up and looking at the samples might give you a hint? Even if the $5$ samples became a million? – ryu576 Jul 19 '21 at 06:36
5

Yes, plenty of them. But not if you don't know the distribution (or at least have some sort of prior over it). If you have some function that isn't the sample average, you can find some other distribution where that function doesn't estimate the mean. – Thomas Lumley Jul 19 '21 at 07:15
1

"The mean is the only estimator of the mean that works over all distributions". I don't know what "works" means in this case, but I assume the Cauchy distribution is a counter example. – Cliff AB Jul 19 '21 at 18:59
You still can't do *better* than the sample average (ie, infinitely wrong in the worst case) if you allow for distributions with no mean, but I think it's reasonable to leave them out. – Thomas Lumley Jul 19 '21 at 19:45
@ryu576 If you have a million samples, then you can probably make a reasonable guess about the distribution (if it follows some common form). But if you can figure out the distribution, then this answer no longer applies. – NotThatGuy Jul 20 '21 at 11:44
Does your point apply only under absolute loss as the evaluation criterion or more generally (e.g. under square loss)? – Richard Hardy Jul 20 '21 at 17:20
@RichardHardy because we only care about who was closer, it makes no difference. – user253751 Jul 20 '21 at 17:34
@user253751, thinking about expected values of losses defined as absolute errors vs. square errors generally can make a difference. My question is about this particular case. – Richard Hardy Jul 20 '21 at 18:32
@RichardHardy: The Lehmann-Scheffe theorem can be extended from having to do with variance: an unbiased estimator that's a function of a statistic that's complete & sufficient for some distributional property within a given family of distributions is a uniformly minimum-risk unbiased estimator of that property, provided the loss function is convex. So in big non-parametric families for which the order statistic of an i.i.d. sample is sufficient & complete, the sample mean is UMRUE for the distribution mean. – Scortchi - Reinstate Monica Jul 20 '21 at 22:15
@NotThatGuy: If you can assume the distribution follows some common form, that's basically the same as having a prior over the distribution. – psmears Jul 21 '21 at 12:44

How can a statistician who has the data for a non-normal distribution guess better than one who only has the mean?

3 Answers3