Why the solution of MLE based on gaussian assumption and the solution of residual-sum of squares are same?

Question

There are two approaches by which solution to multiple linear regression are obtained. One is maximum-likelihood in which it is assumed that erorr $\epsilon$ is normally distributed and other is by directly equating the derivative of residual sum of squares (RSS) to zero. What is the reason that two approaches give similar solution, I MLE approach is more restrictive than RSS approach, but still the solution is similar. What am I missing?

Glen_b · Accepted Answer · 2018-04-02T11:36:10.297

The solution is not merely similar, under the relevant assumptions, it is identical.

We can see this directly from the likelihood:

$\mathcal{L}(\beta) = \frac{1}{(\sqrt{2\pi}\sigma)^n} e^{-\frac{1}{2\sigma^2} (y-X\beta)^\top(y-X\beta)}$

$\mathcal{L}$ is maximized when $-2\log \mathcal{L}$ is minimized.

So $-2\log \mathcal{L}=k+n\log(\sigma^2)+\frac{1}{\sigma^2} (y-X\beta)^\top(y-X\beta)$.

Now for any particular $\sigma^2$, this is minimized when $(y-X\beta)^\top(y-X\beta)$ is minimized ... which is exactly the thing least squares minimizes.

Perhaps it would be clearer considered in terms of the error term.

Let $D=\sum_i d(\epsilon_i)$ be some loss criterion for $d$ some function in terms of the errors $\epsilon_i=y_i-\mu_i$ (where $\mu_i$ is in turn some function of the parameters in our model; e.g. for a regression, $\mu_i=\mathbf{x}_i\beta$, where $\mathbf{x}_i$ is the $i$-th row of $X$). Here we'll take $\sigma^2=1$ for simplicity of exposition.

For example, we might consider $d_1=|\epsilon|$ or $d_2=\frac12 \epsilon^2$.

Now consider we have some density for the errors, $f(\epsilon)\propto e^{-g(\epsilon)}$.

Then the likelihood for the parameters, $\mathcal{L} = e^{-\sum_i g(\epsilon_i)}$ is maximized when $\sum_i g(\epsilon_i)$ is minimized.

Meanwhile, the loss function is minimized when $\sum_i d(\epsilon_i)$ is minimized.

That is, if the density $f$ is such that $g=d$, then minimizing $d$ is identical to MLE, because the thing we want to minimize is literally built into the density so that it maximizes the likelihood. [The same argument applies to $g$ being a monotonic-increasing transformation of $d$.]

The Gaussian density has $g=d_2$ -- it has the form

$f(\epsilon)\propto e^{-\frac12 \epsilon^2}$

So the likelihood is, up to a scaling factor, $e^{-\frac12 \sum_i\epsilon_i^2}$

The least squares criterion is literally built into the Gaussian density.

There's no general relationship between least squares and ML. There is a general relationship between least squares and ML for the mean of a Gaussian density because the likelihood is a monotonic decreasing function of the least squares loss function.

But this doesn't explain mu doubt. I don't understand why does it happen? I understand the mathematical derivation of both the solutions. Is it a coincidence or there is a more general relation between MLE with gaussian assumptions and RSS. — Abhinav Gupta, Dec 12 '15 at 17:20
It's the *opposite* of coincidence. I've just now tried explaining the same thing a different way. I really can't conceive any way to be plainer than that. — Glen_b, Dec 12 '15 at 21:10

Why the solution of MLE based on gaussian assumption and the solution of residual-sum of squares are same?

1 Answers1