3

As titles say, given $\hat{\theta}$ is the Maximum Likelihood estimate of a parameter $\theta$, how to prove that Maximum Likelihood estimate of $g(\theta)$ is $g(\hat{\theta})$.

Additionally, is this property applied to Bayesian estimate?

I think given that $\hat{\theta}$ is the Maximum Likelihood estimate of function $f(x_1, x_2, ..., x_n)$, then $\frac{\partial f(x_1, x_2, ..., x_n)}{\partial \theta}|_{\theta=\hat{\theta}}=0$. Then we can continue proving that $\frac{\partial g(f(x_1, x_2, ..., x_n))}{\partial g(\theta)}|_{g(\theta)=g(\hat{\theta})}=0$

thnghh
  • 61
  • 5
  • 2
    If $g$ is a bijection, this is is somewhat obvious. If $g$ is not a bijection, this is more of a convention. Note that the density of the sample parameterised by $g(\theta)$ is not the transform by $g(\cdot)$ of the density of the sample parameterised by $\theta$. – Xi'an Nov 04 '20 at 10:08
  • 1
    This is a property called Invariance of the MLE – igorkf Nov 04 '20 at 16:34

1 Answers1

4

Note, for the below proof to work you need to assume that the function is $g$ monotonic. (and also note that for non-monotonic functions there might not be always proof possible)


Proof using chain rule

Let's consider for simplicity the likelihood function as a function of a single variable:

$$\mathcal{L}(\theta \vert x_1,x_2, \dots, x_n) = h(\theta)$$

If instead of $\theta$ we use a different parameter $\eta$ and they have the relationship $\theta = g(\eta)$ then the new likelihood is

$$\mathcal{L}(\eta \vert x_1,x_2, \dots, x_n) = h(g(\eta)) = H(\eta)$$

And it's derivative is found with the chain rule

$$ H'(\eta) = h'(g(\eta)) \cdot g'(\eta)$$

And this is zero when $g'(\eta)$ is zero (we can exclude this possibility by restricting ourselves to monotonic functions $h$ as transformation), or when $h'(g(\eta))$ is zero.

So if $\theta_{ML}$ is the parameter such that $h'(\theta_{ML}) =0$ then $h'(g(\eta))$ is zero when $g(\eta) = \theta_{ML}$.


Intuitive graph

Possibly the following graph may help.

When we express the function $f(x)$ in terms of a different parameter $t$ (and in the example $x = 0.1/t$), then it is like stretching and reshaping the graph along the x-axis/x-coordinate, but the peak remains at the same value.

The stretching will change the slope according to the above used chain rule. But for the peak, the slope (which is equal to zero) remains the same.

intuitive

This graph is inspired by this q&a. In that question it is about the transformation of the probability density function. The probability density function does not transform like the likelihood function and will have an additional factor that makes that the peak can be at a different location.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • 1
    This is when assuming the reparameterisation is bijective. (Note that the $g$ in your answer is not the $g$ in the question, which corresponds to $h^{-1}$ in your answer.) – Xi'an Nov 04 '20 at 10:52
  • @Xi'an aside from bijective we should also need that the derivative $h'(\eta)$ exists. I just went for monotonic. – Sextus Empiricus Nov 04 '20 at 11:15
  • @SextusEmpiricus your prove seems clear to me. However it's a little confusing that you let $g(\eta)=\theta$ instead of $g(\theta)=\eta$ as the question. Can I just let $g$ become $g^{-1}$ as Xi'an said? Also, is there anyway to prove the same property for Bayesian Estimation? – thnghh Nov 04 '20 at 13:12
  • 1
    @thnghh my reason to use $g(\eta) = \theta$ instead of $g(\theta) = \eta$ is becasue it is pretier to write $$H'(\eta) = h^\prime(g(\eta)) \cdot g^\prime(\eta)$$ instead of $$H'(\eta) = h^\prime(g^{-1}(\eta)) \cdot {g^{-1}} ^\prime (\eta)$$ – Sextus Empiricus Nov 04 '20 at 14:44
  • @sextusempiricus the first half of proving in case of ML estimation is clear now. Thank you! I’m working with the other half of Bayesian now. – thnghh Nov 04 '20 at 17:56
  • @thnghh If it's about something like the [maximum a posteriori (MAP) estimate](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation), then the answer is: *'No, the MAP estimate does not transform like the maximum likelihood estimate, and the property that you wish to prove is false'*. This is because probability densities (and the posterior is a probability density) do not transform like the likelihood function. You can see [here](https://stats.stackexchange.com/a/445930) for an example. – Sextus Empiricus Nov 04 '20 at 18:01
  • For discrete probability distributions, it might still work. – Sextus Empiricus Nov 04 '20 at 18:05
  • I think the Bayesian estimation is a little bit different from MAP estimation. Instead of finding $\theta$ that maximize the posterior probability, we need to find mean of it (more info at https://en.m.wikipedia.org/wiki/Bayes_estimator). But if it’s a false property, is there any condition on $g$ or $\theta$ that can make it true? – thnghh Nov 04 '20 at 18:23
  • 1
    @thnghh Ah, I see. When you used the term 'bayesian estimate' I thought of the more general term. The 'bayes estimate' that you refer to will indeed have the property the transformation will be invariant in a similar way as the invariance of the maximum likelihood estimate. This is *if" a specific loss function is minimized and the transformation does not influence it. – Sextus Empiricus Nov 04 '20 at 20:13
  • 1
    Say you wish to minimize the mean squared error of the estimate $\hat{\theta}$ then the Bayes estimate is the mean of the posterior (see [here](https://en.m.wikipedia.org/wiki/Bayes_estimator#Minimum_mean_square_error_estimation)). Note that a function of the mean will generally not be the same as the mean of a function $$E(g(\theta)) \neq g\left( E(\theta) \right)$$, so it will depend on whether the loss function remains the same or not when you perform the transformation. It will matter whether your loss function still uses $\theta$ as input or $g(\theta)$. – Sextus Empiricus Nov 04 '20 at 20:24
  • @SextusEmpiricus thank you a lot, it's proved now – thnghh Nov 05 '20 at 14:06