Any theoretical basis for estimating parameter using $P(\theta | D)$ instead of MLE?

Question

To my understanding, when trying to estimate the value of a parameter $\theta$ of a model (e.g. $mu$ of a Normal distribution) given some data $D$ , we can find the MLE which is $\hat{\theta} = argmax_\theta P(D | \theta)$, where $P(D | \theta)$ is the likelihood function.

However, another approach that seems intuitive for estimating $\theta$ is to find $argmax_\theta P(\theta | D)$; that is, find the parameter that has the highest probability given some data.

Does this approach make any sense at all? What are the flaws and how does it compare with MLE?

gunes · Accepted Answer · 2020-05-21T19:23:02.660

4

It makes sense, and is called MAP (maximum a posteriori) estimation. Indeed, it doesn't require you to calculate the posterior distribution exactly because $$\operatorname{argmax}_\theta P(\theta|D)=\operatorname{argmax}_\theta \frac{P(D|\theta)P(\theta)}{P(D)}=\operatorname{argmax}_\theta P(D|\theta)P(\theta)$$

Typically, when $P(\theta)$ is uniform, i.e. a constant value, which means all possible $\theta$ are equally likely, the MLE and MAP estimates are equivalent.

edited May 21 '20 at 19:23

answered May 21 '20 at 19:09

gunes

49,700
3
39
75

Careful, reparameterisation (in)variance of MAP and MLE... – innisfree May 22 '20 at 10:02

Tim · Answer 2 · 2020-05-21T20:00:17.713

@gunes answered your question (+1), but it might be worth adding why you see maximizing the likelihood $P(D|\theta)$, rather then posterior $P(\theta|D)$ so often. Likelihood is a probability distribution that describes your data, parametrized by some parameter $\theta$. You can try different values of the parameter and find such distribution that "fits best" to the data

$$ \hat\theta_\text{MLE} = \underset{\theta}{\operatorname{arg\,max}} \; P(D|\theta) $$

You cannot do the same for $P(\theta|D)$, because you didn't observe any $\theta$, so you cannot really tell that some value of $\theta$ has greater probability, then other. The data $D$ is fixed, so you cannot really check "what would happen if the data was different" as when maximizing the likelihood. Moreover, what would be the distribution $P$ in here? How would you choose the distribution that your parameter has? How would you know that the distribution fits $\theta$, as you didn't observe any $\theta$? There isn't really much that can be done in here to estimate this distribution directly.

However Thomas Bayes found one simple trick, Bayes theorem, which shows how given some likelihood, and a prior $P(\theta)$, we can "revert" the sides of conditional probability and obtain the posterior

$$ P(\theta|D) = \frac{P(D|\theta)\,P(\theta)}{P(D)} \propto P(D|\theta)\,P(\theta) $$

then you can maximize

$$ \hat\theta_\text{MAP} = \underset{\theta}{\operatorname{arg\,max}} \; P(D|\theta)\,P(\theta) $$

There is only one catch: you don't know the prior $P(\theta)$ either! The solution is that we assume some prior distribution, the one that is most reasonable given our best knowledge (or just a guess) and hope that the information in the data would overwhelm the prior. On another hand, in some cases when we have reasonable prior information, we can make up for having not enough data, by using priors. For more details check other questions tagged as bayesian.

I think in business context priors are relatively easy to come up with, but not many people do use them for some reason. For instance, portfolio managers often have a good idea of probability of default, $\theta$, of a portfolio. So, we can combine $P(\theta)$ with observed probabilities and get the posterior — Aksakal, May 21 '20 at 19:54

Any theoretical basis for estimating parameter using $P(\theta | D)$ instead of MLE?

2 Answers2