20

Can you provide an example of an MLE estimator of the mean that is biased?

I am not looking for an example that breaks MLE estimators in general by violating regularity conditions.

All examples I can see on the internet refer to the variance, and I can't seem to find anything related to the mean.

EDIT

@MichaelHardy provided an example where we get a biased estimate of the mean of uniform distribution using MLE under a certain proposed model.

However

https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)#Estimation_of_midpoint

suggests that MLE is a uniformly minimum unbiased estimator of the mean, clearly under another proposed model.

At this point it is still not very clear to me what's meant by MLE estimation if it is very hypothesized model dependent as opposed to say a sample mean estimator which is model neutral. At the end I am interested in estimating something about the population and don't really care about the estimation of a parameter of a hypothesized model.

EDIT 2

As @ChristophHanck showed the model with additional information introduced bias but did not manage to reduce the MSE.

We also have additional results:

http://www.maths.manchester.ac.uk/~peterf/CSI_ch4_part1.pdf (p61) http://www.cs.tut.fi/~hehu/SSP/lecture6.pdf (slide 2) http://www.stats.ox.ac.uk/~marchini/bs2a/lecture4_4up.pdf (slide 5)

"If a most efficient unbiased estimator ˆθ of θ exists (i.e. ˆθ is unbiased and its variance is equal to the CRLB) then the maximum likelihood method of estimation will produce it."

"Moreover, if an efficient estimator exists, it is the ML estimator."

Since the MLE with free model parameters is unbiased and efficient, by definition is this "the" Maximum Likelihood Estimator?

EDIT 3

@AlecosPapadopoulos has an example with Half Normal distribution on math forum.

https://math.stackexchange.com/questions/799954/can-the-maximum-likelihood-estimator-be-unbiased-and-fail-to-achieve-cramer-rao

It is not anchoring any of its parameters like in the uniform case. I would say that settles it, though he hasn't demonstrated the bias of the mean estimator.

Cagdas Ozgenc
  • 3,716
  • 2
  • 29
  • 55
  • 11
    The mean of a uniform on zero and theta. – Christoph Hanck Dec 17 '16 at 20:02
  • 1
    I cannot follow your distinction between "estimating something about the population" and "a parameter of a hypothesized model". In all of parametric statistics, we parameterize a population by some parameters. Of course, we may thus run into issues of misspecification, but that does not seem to be relevant to the issue at hand. – Christoph Hanck Dec 18 '16 at 09:30
  • @ChristophHanck What do you mean by "parameterize a population"? In this forum everybody goes quite pedantic on me all the time. But that phrase sounds more like mass manipulation :). – Cagdas Ozgenc Dec 18 '16 at 09:35
  • 5
    For example, that a population can be characterized by its parameters/moments, like the mean and variance (which would be sufficient for a normal population, for example). And: I do not think that people are any more or less pedantic with you than with anybody else on this forum. – Christoph Hanck Dec 18 '16 at 09:38
  • 2
    If you are feeling unhappy about the apparent sleight of hand of switching between "parameter" and "mean", let me define a certain non-negative distribution in terms of its mean $\mu$, with density $\frac{1}{2\mu}$ on its support of $[0, 2\mu]$... – Silverfish Dec 18 '16 at 12:50
  • 1
    Regarding your edit 2, many of these results are derived under regularity conditions which are not satisfied for the uniform example discussed in this thread, for which the sample space depends on the parameter. – Christoph Hanck Dec 18 '16 at 16:56

6 Answers6

38

Christoph Hanck has not posted the details of his proposed example. I take it he means the uniform distribution on the interval $[0,\theta],$ based on an i.i.d. sample $X_1,\ldots,X_n$ of size more than $n=1.$

The mean is $\theta/2$.

The MLE of the mean is $\max\{X_1,\ldots,X_n\}/2.$

That is biased since $\Pr(\max < \theta) = 1,$ so $\operatorname{E}({\max}/2)<\theta/2.$

PS: Perhaps we should note that the best unbiased estimator of the mean $\theta/2$ is not the sample mean, but rather is $$\frac{n+1} {2n} \cdot \max\{X_1,\ldots,X_n\}.$$ The sample mean is a lousy estimator of $\theta/2$ because for some samples, the sample mean is less than $\dfrac 1 2 \max\{X_1,\ldots,X_n\},$ and it is clearly impossible for $\theta/2$ to be less than ${\max}/2.$
end of PS


I suspect the Pareto distribution is another such case. Here's the probability measure: $$ \alpha\left( \frac \kappa x \right)^\alpha\ \frac{dx} x \text{ for } x >\kappa. $$ The expected value is $\dfrac \alpha {\alpha -1 } \kappa.$ The MLE of the expected value is $$ \frac n {n - \sum_{i=1}^n \big((\log X_i) - \log(\min)\big)} \cdot \min $$ where $\min = \min\{X_1,\ldots,X_n\}.$

I haven't worked out the expected value of the MLE for the mean, so I don't know what its bias is.

Michael Hardy
  • 7,094
  • 1
  • 20
  • 38
  • I think you switched your "" in the last line. max(X1, ... Xn) < θ always. – Noah Dec 17 '16 at 21:20
  • What if I propose a uniform model [a,b] and do the estimation accordingly? I mean why would I start with [0,theta] in the first place. I define my own hypothesis. – Cagdas Ozgenc Dec 17 '16 at 21:33
  • 12
    Cagdas, It's not legitimate to ask for a countexample and then deny that you would propose something else! It's like asking for an example of a fruit that is not red, being shown a blueberry, and then saying it doesn't count because you don't like blueberries. – whuber Dec 17 '16 at 21:45
  • 1
    @CagdasOzgenc : You asked for an example. – Michael Hardy Dec 17 '16 at 21:48
  • @whuber I am envisioning the scenario when I am given the data set, not a density to start with. – Cagdas Ozgenc Dec 17 '16 at 21:48
  • 7
    That's not relevant to the question you asked. – whuber Dec 17 '16 at 21:49
  • 3
    Without a density / distribution, you will be hard-pressed to come up with an MLE... – jbowman Dec 17 '16 at 21:49
  • @jbowman It is up to me to propose the hypothesis. Basically given a suitable hypothesis is MLE still biased? – Cagdas Ozgenc Dec 17 '16 at 21:51
  • 8
    @CagdasOzgenc : Whether the MLE is biased or not depends on the model. There's no such thing as an MLE without a model. And if you alter the model, you alter the MLE. – Michael Hardy Dec 17 '16 at 21:52
  • 2
    Repeating the comments of others, you asked for an example. You shouldn't subsequently say "given a suitable hypothesis" and then use that as an excuse to reject the examples. Either define "suitable hypothesis" precisely, preferably in an edit to the question, or accept the answer. – jbowman Dec 17 '16 at 21:54
  • @MichaelHardy I am not disagreeing with the provided answer. If some people would stop bullying me around, I would get my head around this. Basically what you are saying is that it is incorrect to say that MLE is biased, the correct way to say it is MLE with a certain proposed model is biased. This is a little different that saying for example sample mean is an unbiased estimator. – Cagdas Ozgenc Dec 17 '16 at 22:02
  • 8
    @CagdasOzgenc Here's a socratic question: the sample mean is an unbiased estimator of what? You need a model to have a parameter to be estimating. – Matthew Drury Dec 17 '16 at 22:32
  • 3
    @MatthewDrury Mean doesn't necessarily have to be a parameter. It has a clear definition. Sample mean is unbiased estimator of the mean irrespective of what the true DGP is. What model to propose for MLE is on the other hand is open ended. That's why I am a little confused here. – Cagdas Ozgenc Dec 17 '16 at 22:36
  • 1
    Sure, but you did say "sample mean is an unbiased estimator", to me, that is an incomplete statement. An estimator is an estimator of something, and that something is most generally a parameter of interest (terminology from Wasserman) in a statistical model. The same is true of a MLE, the model must exist before the MLE of some parameter. I don't think it makes sense to posit the existence of an MLE without a model. – Matthew Drury Dec 17 '16 at 23:02
  • 9
    The mean of an i.i.d. sample is an unbiased estimator of the population mean, but one cannot speak of a maximum-likelihood estimator of anything without more structure than what is needed to speak of an unbiased estimator of something. – Michael Hardy Dec 17 '16 at 23:14
  • 1
    I'm not sure if I understand this answer correctly. For the first part, are you talking about a uniform distribution on $[0,\theta]$ where $\theta$ is unknown, and the only information we have available to attempt to determine it is the set of samples $X_i$ drawn from that distribution? – David Z Dec 18 '16 at 05:39
  • 2
    @DavidZ: Right. In maximum likelihood estimation, you start with a sample of independent observations from a distribution that you know (or assume) to have a given form, that is, from a distribution that is fully characterized by one or more unknown parameters. MLE then provides a way to estimate the values of those parameters. In Michael Hardy's example, the distribution has the form "uniform distribution on the interval $[0, \theta]$", where $\theta$ is the parameter. (This could also be written as "uniform distribution on the interval $[0,2\mu]$".) – ruakh Dec 18 '16 at 05:57
  • @ruakh Thanks. Now it makes sense. I'm familiar with the basic idea of MLE as you've described; there's just something about the way this answer is written that made it hard for me to wrap my head around at first. – David Z Dec 18 '16 at 06:00
  • 1
    @DavidZ : That is correct, except that I would refer to the sample $X_1,\ldots,X_n$ rather than calling it a "set of samples". It's just one sample, consisting of $n$ observations. $\qquad$ – Michael Hardy Dec 18 '16 at 16:27
  • My answer here contains calculation relevant to the Pareto: http://stats.stackexchange.com/questions/94402/what-is-the-difference-between-finite-and-infinite-variance/100161#100161 – kjetil b halvorsen Dec 18 '16 at 17:48
  • I am confused by your added "PS": the formula yields something larger than max(X_i) because (n+1)/n > 1. Is this intended? – amoeba Jan 09 '17 at 23:35
19

Here's an example that I think some may find surprising:

In logistic regression, for any finite sample size with non-deterministic outcomes (i.e. $0 < p_{i} < 1$), any estimated regression coefficient is not only biased, the mean of the regression coefficient is actually undefined.

This is because for any finite sample size, there is a positive probability (albeit very small if the number of samples is large compared with the number of regression parameters) of getting perfect separation of outcomes. When this happens, estimated regression coefficients will be either $-\infty$ or $\infty$. Having positive probability of being either $-\infty$ or $\infty$ implies the expected value is undefined.

For more on this particular issue, see the Hauck-Donner-effect.

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
  • 1
    This is quite clever. I wonder if the MLE of logistic regression coefficients is unbiased conditional on the nonoccurence of the Hauck-Donner effect? – gung - Reinstate Monica Dec 28 '16 at 01:48
  • 3
    @gung: Short answer: ignoring the Hauck-Donner effect, there is still upward bias in absolute regression coefficients (i.e. negative coefficients have downward bias, positive have upward bias). Interestingly, there appears to be a bias toward 0.5 in estimated probabilities. I've started writing up about it on this [post](http://cliffstats.weebly.com/almost-surely-degenerate-ramblings/exploring-the-undefined-bias-of-logistic-regression), but haven't put up my results on the biases of the estimated probabilities. – Cliff AB Dec 28 '16 at 18:36
  • David Firth has some related papers on using jeffrey priors as penalty for logistic regression – nan hu Dec 16 '21 at 03:10
10

Although @MichaelHardy has made the point, here is a more detailed argument as to why the MLE of the maximum (and hence, that of the mean $\theta/2$, by invariance) is not unbiased, although it is in a different model (see the edit below).

We estimate the upper bound of the uniform distribution $U[0,\theta]$. Here, $y_{(n)}$ is the MLE, for a random sample $y$. We show that $y_{(n)}$ is not unbiased. Its cdf is \begin{eqnarray*} F_{y_{(n)}}(x)&=&\Pr\{Y_1\leqslant x,\ldots,Y_n\leqslant x\}\\ &=&\Pr\{Y_1\leqslant x\}^n\\ &=&\begin{cases} 0&\qquad\text{for}\quad x<0\\ \left(\frac{x}{\theta}\right)^n&\qquad\text{for}\quad 0\leqslant x\leqslant\theta\\ 1&\qquad\text{for}\quad x>\theta \end{cases} \end{eqnarray*} Thus, its density is $$f_{y_{(n)}}(x)= \begin{cases} \frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}&\qquad\text{for}\quad 0\leqslant x\leqslant\theta\\ 0&\qquad\text{else} \end{cases} $$ Hence, \begin{eqnarray*} E[Y_{(n)}]&=&\int_0^\theta x\frac{n}{\theta}\left(\frac{x}{\theta}\right)^{n-1}dx\\ &=&\int_0^\theta n\left(\frac{x}{\theta}\right)^{n}dx\\ &=&\frac{n}{n+1}\theta \end{eqnarray*}

EDIT: It is indeed the case that (see the discussion in the comments) the MLE is unbiased for the mean in the case in which both the lower bound $a$ and upper bound $b$ are unknown. Then, the minimum $Y_{(1)}$ is the MLE for $a$, with (details omitted) expected value $$ E(Y_{(1)})=\frac{na+b}{n+1} $$ while $$ E(Y_{(n)})=\frac{nb+a}{n+1} $$ so that the MLE for $(a+b)/2$ is $$ \frac{Y_{(1)}+Y_{(n)}}{2} $$ with expected value $$ E\left(\frac{Y_{(1)}+Y_{(n)}}{2}\right)=\frac{na+b+nb+a}{2(n+1)}=\frac{a+b}{2} $$

EDIT 2: To elaborate on Henry's point, here is a little simulation for the MSE of the estimators of the mean, showing that while the MLE if we do not know the lower bound is zero is unbiased, the MSEs for the two variants are identical, suggesting that the estimator which incorporates knowledge of the lower bound reduces variability.

theta <- 1
mean <- theta/2
reps <- 500000
n <- 5
mse <- bias <- matrix(NA, nrow = reps, ncol = 2)

for (i in 1:reps){
  x <- runif(n, min = 0, max = theta)
  mle.knownlowerbound <- max(x)/2
  mle.unknownlowerbound <- (max(x)+min(x))/2
  mse[i,1] <- (mle.knownlowerbound-mean)^2
  mse[i,2] <- (mle.unknownlowerbound-mean)^2
  bias[i,1] <- mle.knownlowerbound-mean
  bias[i,2] <- mle.unknownlowerbound-mean

}

> colMeans(mse)
[1] 0.01194837 0.01194413

> colMeans(bias)
[1] -0.083464968 -0.000121968
Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106
  • Because Wikipedia is proposing a different model to begin with. That's where my confusion lies. – Cagdas Ozgenc Dec 18 '16 at 09:31
  • Yes, but once we adjust to the special case discussed here, namely $a=0$, we are back at square 1. In that case, we do not need the sample minimum for estimation anymore, as we *know* that the lower bound is zero, so that the MLE of the midpoint (=median=mean) simply becomes $(max+0)/2$ again. – Christoph Hanck Dec 18 '16 at 09:34
  • That's quite unfortunate because additional external information should have put us in a better position in terms of estimation, not worse. – Cagdas Ozgenc Dec 18 '16 at 09:37
  • 2
    I have not worked out the details, but the MLE in that model could be unbiased if the minimum overestimates the lower bound by the same amount as the maximum underestimates the maximum, so that the midpoint is being estimated without bias. – Christoph Hanck Dec 18 '16 at 09:41
  • 4
    @CagdasOzgenc: unbiasedness is not the only or even the most important measure of *better*. By knowing one end of the support precisely, you may lose the balance between errors in estimating the mean, but you end up with (for example) a better estimate of the range – Henry Dec 18 '16 at 10:12
  • Why do you think reducing variability in the expense of increasing bias is justified when there is no reduction in MSE? That's not how we play this game of biased estimation. – Cagdas Ozgenc Dec 18 '16 at 14:40
  • 6
    Maximum likelihood estimators are not always "best" across all criteria for small sample sizes. So what? They don't pretend to be, either. If you want to use a different estimator for your problem that has better properties according to some criterion for sample sizes that are in the neighborhood of your actual sample size, you're free to do so. I do so, and so do other people. No one is claiming that using MLE is justified in all situations just because it's MLE. – jbowman Dec 18 '16 at 15:10
  • @CagdasOzgenc, I was not claiming to provide justification for using either variant of the MLE, just numerically illustrating the point made by Henry that an estimator with lower bias need not improve upon biased ones according to other criteria. (That said, I did indeed expect the MSE of the MLE with known lower bound to be smaller, which does not seem to be the case.) – Christoph Hanck Dec 18 '16 at 16:52
5

Completing here the omission in my answer over at math.se referenced by the OP,

assume that we have an i.i.d. sample of size $n$ of random variables following the Half Normal distribution. The density and moments of this distribution are

$$f_H(x) = \sqrt{2/\pi}\cdot \frac 1{v^{1/2}}\cdot \exp\big\{-\frac {x^2}{2v} \big\} \\ E(X) = \sqrt{2/\pi}\cdot v^{1/2}\equiv \mu,\;\; \operatorname{Var}(X) = \left(1-\frac 2 \pi \right)v$$

The log-likelihood of the sample is

$$L(v\mid \mathbf x) = n\ln\sqrt{2/\pi}-\frac n2\ln v -\frac 1 {2v} \sum_{i=1}^n x_i^2$$

The first derivative with respect to $v$ is

$$\frac {\partial}{\partial v}L(v\mid\mathbf x) = -\frac n{2v} + \frac 1 {2v^2} \sum_{i=1}^n x_i^2,\implies \hat v_\text{MLE} = \frac 1n \sum_{i=1}^nx_i^2$$

so it is a method of moments estimator. It is unbiased since,

$$E(\hat v_\text{MLE}) = E(X^2) = \operatorname{Var}(X) + [E(X)])^2 = \left(1-\frac 2 \pi \right)v + \frac 2 \pi v = v$$

But, the resulting estimator for the mean is downward biased due to Jensen's inequality

\begin{align} \hat \mu_\text{MLE} = \sqrt{2/\pi}\cdot \sqrt {\hat v_\text{MLE}} \implies & E\left(\hat \mu_\text{MLE}\right) = \sqrt{2/\pi}\cdot E\left(\sqrt {\hat v_\text{MLE}}\,\right) \\[6pt] & < \sqrt{2/\pi}\cdot \left[\sqrt {E(\hat v_\text{MLE})}\,\right] = \sqrt{2/\pi}\cdot \sqrt v = \mu \end{align}

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
4

The famous Neyman Scott problem has an inconsistent MLE in that it never even converges to the right thing. Motivates the use of conditional likelihood.

Take $(X_i, Y_i) \sim \mathcal{N}\left(\mu_i, \sigma^2 \right)$. The MLE of $\mu_i$ is $(X_i + Y_i)/2$ and of $\sigma^2$ is $\hat{\sigma}^2 = \sum_{i=1}^n \frac{1}{n} s_i^2$ with $s_i^2 = (X_i - \hat{\mu}_i)^2/2 + (Y_i - \hat{\mu}_i)^2/2 = (X_i - Y_i)^2 / 4$ which has expected value $\sigma^2/4$ and so biased by a factor of 2.

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • 2
    While this example holds true, this actually defies one of the basic regularity conditions for asymptotic results of MLE's: that $k / n \rightarrow 0$, where $k$ is the number of parameters estimated and $n$ is the sample size. – Cliff AB Dec 18 '16 at 18:38
  • 1
    @CliffAB the assumption violation is that the parametric dimension is not fixed. The dimension of $\Theta$ goes to $\infty$ as $n \rightarrow \infty$. I think that's what you're saying, but don't know what $k$ means. The practical illustration of this example of course is that these results would be biased even in small samples and you have to use conditional likelihood, like a mixed effects model, to estimate $\sigma$ in this case. – AdamO Dec 19 '16 at 15:06
3

There is an infinite range of examples for this phenomenon since

  1. the maximum likelihood estimator of a bijective transform $\Psi(\theta)$ of a parameter $\theta$ is the bijective transform of the maximum likelihood estimator of $\theta$, $\Psi(\hat{\theta}_\text{MLE})$;
  2. the expectation of the bijective transform of the maximum likelihood estimator of $\theta$, $\Psi(\hat{\theta}_\text{MLE})$, $\mathbb{E}[\Psi(\hat{\theta}_\text{MLE})]$ is not the bijective transform of the expectation of the maximum likelihood estimator, $\Psi(\mathbb{E}[\hat{\theta}_\text{MLE}])$;
  3. most transforms $\Psi(\theta)$ are expectations of some transform of the data, $\mathfrak{h}(X)$, at least for exponential families, provided an inverse Laplace transform can be applied to them.
Xi'an
  • 90,397
  • 9
  • 157
  • 575