66

What is the main difference between maximum likelihood estimation (MLE) vs. least squares estimaton (LSE) ?

Why can't we use MLE for predicting $y$ values in linear regression and vice versa?

Any help on this topic will be greatly appreciated.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
evros
  • 751
  • 2
  • 7
  • 6
  • 10
    You can use MLE in linear regression if you like. This can even make sense if the error distribution is non-normal and your goal is to obtain the "most likely" estimate rather than one which minimizes the sum of squares. – Richard Hardy Mar 27 '15 at 14:57
  • 24
    Under normal error assumption, as is typically assumed in linear regression, the MLE and the LSE are the same! – TrynnaDoStat Mar 27 '15 at 15:07
  • 1
    Search our site for the [Gauss-Markov theorem](http://stats.stackexchange.com/search?q=gauss+markov+theorem). – whuber Mar 27 '15 at 17:41
  • thanks for all the replies. Now this makes sense. While searching for this topic on the net, I came across this article. Maybe this also helps:https://radfordneal.wordpress.com/2008/08/09/inconsistent-maximum-likelihood-estimation-an-ordinary-example/ – evros Mar 27 '15 at 19:17
  • @TrynnaDoStat: No it's not the case. ML and LSE are not the same. In fact ML is an different approach which can emulate a subset of LSE and other estimators. See my answer for further details. – nali Sep 13 '15 at 11:00
  • 1
    An answer is also provided at http://stats.stackexchange.com/questions/12562/equivalence-between-least-squares-and-mle-in-gaussian-model. – whuber Dec 12 '15 at 21:40
  • @nail You are wrong to say that ML is a subset of LSE. If you look at the Gauss-Markov Theorem you will see that the estimators differ based on the distribution of the data. – Michael R. Chernick Mar 05 '17 at 02:02

3 Answers3

33

I'd like to provide a straightforward answer.

What is the main difference between maximum likelihood estimation (MLE) vs. least squares estimation (LSE) ?

As @TrynnaDoStat commented, minimizing squared error is equivalent to maximizing the likelihood in this case. As said in Wikipedia,

In a linear model, if the errors belong to a normal distribution the least squares estimators are also the maximum likelihood estimators.

they can be viewed as almost the same in your case since the conditions of the least square methods are these four: 1) linearity; 2) linear normal residuals; 3) constant variability/homoscedasticity; 4) independence.

Let me detail it a bit. Since we know that the response variable $y$. $$y=w^T X +\epsilon \quad\text{ where }\epsilon\thicksim N(0,\sigma^2)$$ follows a normal distribution(normal residuals),
$$P(y|w, X)=\mathcal{N}(y|w^TX, \sigma^2I)$$
then the likelihood function(independence) is,

\begin{align} L(y^{(1)},\dots,y^{(N)};w, X^{(1)},\dots,X^{(N)}) &= \prod_{i=1}^N \mathcal{N}(y^{(i)}|w^TX^{(i)}, \sigma^2I) \\ &= \frac{1}{(2\pi)^{\frac{N}{2}}\sigma^N}exp(\frac{-1}{2\sigma^2}(\sum_{i=1}^N(y^{(i)}-w^TX^{(i)})^2)). \end{align}

Maximizing L is equivalent to minimizing(since other stuff are all constants, homoscedasticity) $$\sum_{i=1}^n(y^{(i)}-w^TX^{(i)})^2.$$ That's the least-squares method, the difference between the expected $\hat{Y_i}$ and the actual $Y_i$.

Why can't we use MLE for predicting $y$ values in linear regression and vice versa?

As explained above we're actually(more precisely equivalently) using the MLE for predicting $y$ values. And if the response variable has arbitrary distributions rather than the normal distribution, like Bernoulli distribution or anyone from the exponential family we map the linear predictor to the response variable distribution using a link function(according to the response distribution), then the likelihood function becomes the product of all the outcomes(probabilities between 0 and 1) after the transformation. We can treat the link function in the linear regression as the identity function(since the response is already a probability).

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52
  • 3
    You may want to define "this case" a bit more clearly since in general, maximum likelihood and least squares are not the same thing. – Matthew Gunn Mar 04 '17 at 18:48
  • 2
    @MatthewGunn Yeah, I used "equivalent to" other than "the same". – Lerner Zhang Mar 05 '17 at 01:29
  • 1
    Would be great if you would give us an example where the linear model follows non-normal error distribution, and how you use MLE in such a case to estimate the best coefficients. If not possible, at least can you point us to a correct source, which demonstrates this using linear models like Poisson regression – VM_AI May 30 '19 at 12:20
  • @VM_AI I have added the four conditions for the linear model and I thought your request would be impossible in this scenario. – Lerner Zhang Oct 10 '20 at 15:08
14

ML is a higher set of estimators which includes least absolute deviations ($L_1$-Norm) and least squares ($L_2$-Norm). Under the hood of ML the estimators share a wide range of common properties like the (sadly) non-existent break point. In fact you can use the ML approach as a substitute to optimize a lot of things including OLS as long as you are aware what you're doing.

$L_2$-Norm goes back to C. F. Gauss and is around 200 years old while the modern ML approach goes back to (IMHO) Huber 1964. Many scientists are used to $L_2$-Norms and their equations. The theory is well understood and there are a lot of published papers which can be seen as useful extensions like:

  • data snooping
  • stochastic parameters
  • weak constraints

Professional applications don't just fit data, they check:

  • if the parameter are significant
  • if your dataset has outliers
  • which outlier can be tolerated since it does not cripple the performance
  • which measurement should be removed since it does not contribute to the degree of freedoms

Also there are huge number of specialized statistic tests for hypotheses. This does not necessary apply to all ML estimators or should be at least stated with a proof.

Another profane point is that $L_2$-Norm is very easy to implement, can be extended to Bayesian regularization or other algorithms like Levenberg-Marquard.

Not to forget: Performance. Not all least square cases like Gauss-Markov $\mathbf{X\beta}=\mathbf{L}+\mathbf{r}$ produce symmetric positive definite normal equations $(\mathbf{X}^{T}\mathbf{X})^{-1}$. Therefore I use a separate libraries for each $L_2$-Norm. It is possible to perform special optimizations for this certain case.

Feel free to ask for details.

nali
  • 424
  • 3
  • 12
0

Let's derive the equivalence through the Bayesian/PGM approach.

Here is the Bayesian network of linear regression:

enter image description here

We can factorize the joint distribution according to the above graph $\mathcal{G'}$:

$$P(y, w, X) = P(y|w, X)P(w)P(X)$$ Since the $P(X)$ is fixed we obtain this: $$P(y, w, X) \propto P(y|w, X)P(w)$$

Since maximum likelihood is a frequentist term and from the perspective of Bayesian inference a special case of maximum a posterior estimation that assumes a uniform prior distribution of the parameters. Then we just ignore $P(w)$.

Then we get this: $P(y, w, X) \propto P(y|w, X)$, and we assume $P(y|w, X)=\mathcal{N}(y|w^TX, \sigma^2I)$ due to the normal residuals assumption.

Alone the same line in this answer we see that the least square method is equivalent to the meximum likelihood method in your case.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52