7

I know that $\frac{1}{n-1}\sum_{i=1}^{n}(X-\bar{X_{n}})^{2}$ is an unbiased estimator for the variance.

I thought of such an estimator to be useful when it is not know from which distribution the data at hand are coming from.

Now given I have a data set and I do know which is the underlying distributions of the observations. Shouldn't I prefer a Maximum Likelihood Estimator for the variance even if it is biased since it obviously makes use of the underlying distribution of the data over the general variance estimator cited above, which is not for any specific distribution?

Thanks

cardinal
  • 24,973
  • 8
  • 94
  • 128
Pugl
  • 951
  • 1
  • 16
  • 40
  • 1
    Note that the ML estimator has just a $\frac{1}{n}$ factor instead of $\frac{1}{n-1}$. For moderately large $n$'s, using one or the other will probably make no difference at all. – ocram Feb 08 '13 at 21:51
  • 5
    @ocram That's the ML estimator for a *Normal* distribution (and, possibly, a few others), but it's not universal. For instance, *no* multiple of a sum of squares of deviations from the mean is an ML estimator of the variance of a Poisson distribution. – whuber Feb 08 '13 at 22:50
  • 1
    Both forms use *exactly* the same information - the sums of squares of deviation from the mean. The only difference is that scaling factor. If you need the variance estimate to be unbiased you should use it, but it's not (say) minimum MSE for the variance, and it's not unbiased if you're taking the square root and using that for the standard deviation. At least the ML estimate (ML at the normal) is still ML for the s.d. In practice there's rarely much difference and I regularly use each in different circumstances with little worry. I'm usually not worried about an unbiased variance estimate. – Glen_b Feb 08 '13 at 22:53
  • Thanks for the answers. @Glen_b: Could you possibly elaborate on the statement that the ML estimate is ML for the s.d.? Best, Pegah – Pugl Feb 08 '13 at 23:07
  • Okay, but I'll have to take that to an answer. – Glen_b Feb 08 '13 at 23:15
  • @whuber: You are right! Thanks for pointing that out! – ocram Feb 09 '13 at 05:19
  • 1
    See also the last paragraph of Glen_b's answer in the thread ["Maximum Likelihood Estimation — why it is used despite being biased in many cases"](https://stats.stackexchange.com/questions/183006/). – Richard Hardy Apr 26 '18 at 09:00

2 Answers2

9

I think the answer is generally yes. If you know more about a distribution then you should use that information. For some distributions this will make very little difference, but for other it could be considerable.

As an example, consider the poisson distribution. In this case the mean and the variance are both equal to the parameter $\lambda$ and the ML estimate of $\lambda$ is the sample mean.

The charts below show 100 simulations of estimating the variance by taking the mean or the sample variance. The histogram labelled X1 is the using sample mean, and X2 is using the sample variance. As you can see, both are unbiased but the mean is a much better estimate of $\lambda$ and hence a better estimate of he variance.

enter image description here

The R code for the above is here:

library(ggplot2)
library(reshape2)
testpois = function(){
  X = rpois(100, 4)
  mu = mean(X)
  v = var(X)
  return(c(mu, v))
}

P = data.frame(t(replicate(100, testpois())))
P = melt(P)

ggplot(P, aes(x=value)) + geom_histogram(binwidth=.1, colour="black", fill="white") +
  geom_vline(aes(xintercept=mean(value, na.rm=T)),   # Ignore NA values for mean
             color="red", linetype="dashed", size=1) + facet_grid(variable~.)

As to the question of bias, I wouldn't worry too much about your estimator being biased (in the example above it isn't, but that is just luck). If unbiasedness is important to you you can always use Jackknife to try remove the bias.

Corvus
  • 4,573
  • 1
  • 27
  • 58
  • 1
    That's a very good answer, thank you a lot. Indeed I was not mainly interested in bias, but thought of bias as an example of a criterion which could be used to prefer one approach over the other. So, I was generally interested in knowing if there are theoretical reasons to rather prefer the ML estimate, since as you say intuitively it makes more sense to use all the information available. – Pugl Feb 09 '13 at 14:46
2

I have moved my comment to an answer so I can expand on it as requested.

[If you mean the variance form $\frac{1}{n}\sum_{i=1}^{n}(X-\bar{X_{n}})^{2}$ as ML (which it is for the normal), then both forms use exactly the same information - the sums of squares of deviation from the mean. The only difference is that scaling factor.]

If you need the variance estimate to be unbiased you could use it (note that in general you could take any MLE for the variance at a particular distribution and see if you can at least approximately unbias that; it may be more efficient), but it's not (say) minimum MSE for the variance, and it's not unbiased if you're taking the square root and using that for the standard deviation.

At least the ML estimate for the variance is still ML for the s.d. (irrespective of which distribution for you have an MLE of the variance).

Here's why I say that:

MLE's have the property of being invariant to transformation of parameters - the MLE of $g(\theta)$ is $g(\hat{\theta})$ (or more concisely, $\widehat{g(\theta)}=g(\hat{\theta})$). See the brief discussion here, and the stuff under note 2 here.

None of those prove it, but I'll give you a (somewhat handwavy) motivation/outline of an argument for the simple case of monotonic transformations. You can find a complete argument in many texts that discuss ML at more than a really elementary level.

In the case of monotonic transformations: Take a simple case - imagine I have some curve ($y$ vs $x$) with a single peak somewhere in the middle (both a global and local maximum). Now I transform the $x$ to $\xi$ ($\xi=t(x)$) while $y$ is unchanged. The shape of the curve changes, but the corresponding $y$'s don't. The original maximum of $y$ is still the same maximum at the corresponding place in $\xi$ as it was under $x$ (that is, if the maximum was at $x^*$, it's now at $\xi^*=t(x^*)$. You should see how to extend that intuition to a monotonic transformation and any global maximum. [The more general case of non-monotonic transformations is less immediately obvious, but is still true. Edit: It's true in the case of one-to-one functions by a similar argument to the above.]

Returning to the original answer:

In practice (in the the $n$ vs $n-1$ case) there's rarely much difference and I regularly use each in different circumstances with little worry. I'm usually not worried about an unbiased variance estimate

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 2
    (-1) nowhere in the question does it state that his data is normal. OP is asking, if knows the underlying family of distributions (Poisson, Normal, Gamma, or whatever) should he base his estimator on ML even if it is biased, or just use the generic (nonparametric) unbiased estimator anyways. For the Poisson distribution, there is a lot more than just a scaling factor separating the MLE of the variance and the unbiased estimator the OP gave. – guy Feb 08 '13 at 23:40
  • When the original question calls the $\frac{1}{n}$ form of variance "ML", *that* implies normality. That is, *I* introduced no assumption not made by the question. I merely pointed that assumption out. You're downvoting my answer because I mention that implied assumption explicitly rather than have it merely implied? What? – Glen_b Feb 08 '13 at 23:44
  • 2
    Nowhere in the OP do I see that scaling factor, and as far as I can tell it hasn't been edited... – guy Feb 08 '13 at 23:45
  • @guy Okay, that's true. I was conflating ocram's comment (which the OP didn't contradict as not being applicable) with the question. I have made some edits. – Glen_b Feb 08 '13 at 23:52
  • 2
    Although I do understand the invariance property of ML, I am not sure that the original question has been fully adressed (and as @guy has correctly pointed out I did not assume any specific family of distributions). Pegah – Pugl Feb 09 '13 at 00:19
  • @Pegah The difficulty is with the word 'should'. Such things depend on your criteria. What do you want to achieve? [As I pointed out in my answer, you could consider trying to get the advantage of the information in an MLE and near-unbiasedness if you 'unbias' your ML estimator (at least to a first order approximation, say).] Perhaps you could edit your question to add more detail about what you most want to achieve. [Alternatively - why do you care about either ML or unbiasedness rather than some other criterion? What do you need to achieve?] – Glen_b Feb 09 '13 at 08:28
  • Surely the ML thing only works for monotonic functions. Suppose $\hat\theta_{ML} = 4$ and then $g(\theta)= 1$ if $\theta \in (3.99, 4.01)$ otherwise $g(\theta)=0$, then the maximum likelihood estimate of $g(\theta)$ is going to be 0 for some reasonable width likelihood function. – Corvus Feb 09 '13 at 10:27
  • Rather I think the ML estimator consistency only works for bijections. Which I suppose may not be monotonic in the case of discontinuous functions. – Corvus Feb 09 '13 at 12:22
  • @Corone no, the "ML thing" (I assume you mean the invariance property) works for any function, not just bijections. If $\hat \theta_{ML} = 4$ then the ML estimator of $g(\theta)$ is 1. This is essentially a matter of definition; if $\eta = g(\theta)$ where $g$ is not an injection, we define $L(\eta)$ to be $\sup_{\theta: g(\theta) = \eta} L(\theta)$. With this definition, it is trivial that the invariance property holds for any $g$. – guy Feb 09 '13 at 20:10
  • But why would you define it as such? I can right the likelihood function of $\eta$ , so surely the maximum likelihood is most sensibly defined as the value that maximises the likelihood of $\eta$. What you describe is surely eta at the maxinum likelihood, not the maximum likelihood of eta. Maximum likelihood of eta being one goes against the sense of maximum likelihood. The value of eta that is most likeli, I.e. has most probability density is zero. – Corvus Feb 09 '13 at 22:04
  • @Corone look up a formal definition of maximum likelihood if you don't believe me. Or, you could try to write down the likelihood as a function of $\eta$ in a sensible way if (say) $\theta$ is the mean of the normal distribution and $\eta = g(\theta)$ for your function $g(\cdot)$ and realize that the likelihood you wrote down wouldn't be able to distinguish between normal distributions having different means. The likelihood isn't a density, so I don't know what you mean when you say "the value of $\eta$ that has the most probability density." – guy Feb 09 '13 at 22:45
  • Yes, I've just been looking it up. Likelihood induced by $\eta$ it is called. You live and learn. I have only ever considered the bayesian interpretation, where it is just the joint pdf considered as a function of the quantity of interest. I'm not sure I understand your comment about distinguish the means - I don't see any reason why it shiuld be able? – Corvus Feb 09 '13 at 23:20
  • Ah no, I get it. The frequentist has a major issue because they have no prior to set the weights between the different likelihoods that all map to the same quantity. Therefore it is simply defined as the largest. I have always considered ML to be MAP with uniform priors. Now I know i'm wrong! – Corvus Feb 10 '13 at 00:06