Differences between MLE and MAP estimators

Question

Generally speaking, what are the differences between an MLE and a MAP estimator?

If I wanted to improve the performance of a model, how would these differences come into play? Are there specific assumptions about the model or the data that would cause one to be favored over the other?

It's no more complicated than that one is a likelihood based procedure and one is a Bayesian procedure. "Updating a model" does not require a Bayesian analysis, though quantifying the previous model state as a prior is convenient. For MLE, you simply condition on the previous data that informed the model state. For MAP, it's not as straightforward how to come up with interval estimates though several methods have been proposed and most are good. — AdamO, Mar 16 '21 at 20:53

score 3 · Answer 1 · answered Mar 19 '21 at 15:03

MLE is informed entirely by the likelihood and MAP is informed by both prior and likelihood. Both methods return point estimates for parameters via calculus-based optimization. MLE comes from frequentist statistics where practitioners let the likelihood "speak for itself." Whereas MAP comes from Bayesian statistics where prior beliefs (usually informed by domain knowledge of parameters) effectively regularize the point estimate.

Note: MAP, while Bayesian, is atypical of Bayesian philosophy. Bayesian statistics generally treats parameters, themselves, as distributions as opposed to point estimates. Sampling techniques such as MCMC, or newer methods like variational inference can help approximate the distribution.

score 1 · Answer 2 · answered Mar 19 '21 at 14:15

By way of example, consider a binomial model with beta prior $\theta\sim Beta(\alpha,\beta)$ for which we have observed $k$ successes in $n$ attempts. The MLE is $\hat\theta=k/n$.

The posterior is well-known to be \begin{eqnarray} \pi(\theta|y) &\propto&\theta ^{\alpha _{0}+k-1}\left( 1-\theta \right) ^{\beta _{0}+n-k-1}\label{betabernoulliposterior} \end{eqnarray} Thus $\theta |y\sim Beta(\alpha+k,\beta+n-k)$.

Hence, by properties of the beta distribution the posterior mode (the MAP is just that) $$ MAP=Mode(\theta|y)=\frac{\alpha+k-1}{\alpha+\beta+n-2} $$ Also note that, for a uniform prior with $\alpha=\beta=1$, the MAP and MLE are identical.

score 0 · Answer 3 · answered Mar 19 '21 at 04:16

The difference lies in the way how frequentist vs bayesian treats statistics.

Generally speaking, Maximum likelihood Estimation (MLE) is used to find a group of parameters in order to maximum likelihood function of a given probability density function (pdf) or probability mass function (pmf). When doing statistical inference, frequentist think of a model with fixed number of parameters to fit the data. From their point of view, these parameters are unknown but fixed points. They are estimated using point estimation methods (MLE, methods of moment, EM).

Maximum A Posterior (MAP) comes from a bayesian point of view. Instead of viewing parameters as fixed points, bayesian treat parameters in the model as random variables, which follow a prior distribution. Roughly speaking, we have prior belief in which distribution these parameters come from (normal,beta, etc). Once new data comes in, we update our prior belief, which leading to a posterior belief. In other words, we now have a better idea from which distribution these parameters come. In statistics, the posterior belief is called "posterior distribution". Once we obtained the posterior distribution, the MAP is simply its mode. Intuitively, you can think of MAP as a point estimate that is most likely from posterior distribution. One caveat for MAP method is that it only considers the most likely point without taking into account other values of parameters from posterior distribution, which leads to a huge loss of useful information in posterior distribution. A better, yet computationally exhaustive method is called "Full Bayesian" method, which computes the integral with respect to all parameters using Bayes Formula. For cases that are impossible to derive a closed-form expression for the integral , Markov Chain Monte Carlo (MCMC) is often used to tackle this problem.

Regarding your second question, if your assumed prior distribution resembles well with the 'true' distribution model parameters come from, MAP will give a better point estimate. Otherwise, if the prior distribution is off, MLE will perform better than MAP.

Differences between MLE and MAP estimators

3 Answers3