There are two approaches by which solution to multiple linear regression are obtained. One is maximum-likelihood in which it is assumed that erorr $\epsilon$ is normally distributed and other is by directly equating the derivative of residual sum of squares (RSS) to zero. What is the reason that two approaches give similar solution, I MLE approach is more restrictive than RSS approach, but still the solution is similar. What am I missing?
1 Answers
The solution is not merely similar, under the relevant assumptions, it is identical.
We can see this directly from the likelihood:
$\mathcal{L}(\beta) = \frac{1}{(\sqrt{2\pi}\sigma)^n} e^{-\frac{1}{2\sigma^2} (y-X\beta)^\top(y-X\beta)}$
$\mathcal{L}$ is maximized when $-2\log \mathcal{L}$ is minimized.
So $-2\log \mathcal{L}=k+n\log(\sigma^2)+\frac{1}{\sigma^2} (y-X\beta)^\top(y-X\beta)$.
Now for any particular $\sigma^2$, this is minimized when $(y-X\beta)^\top(y-X\beta)$ is minimized ... which is exactly the thing least squares minimizes.
Perhaps it would be clearer considered in terms of the error term.
Let $D=\sum_i d(\epsilon_i)$ be some loss criterion for $d$ some function in terms of the errors $\epsilon_i=y_i-\mu_i$ (where $\mu_i$ is in turn some function of the parameters in our model; e.g. for a regression, $\mu_i=\mathbf{x}_i\beta$, where $\mathbf{x}_i$ is the $i$-th row of $X$). Here we'll take $\sigma^2=1$ for simplicity of exposition.
For example, we might consider $d_1=|\epsilon|$ or $d_2=\frac12 \epsilon^2$.
Now consider we have some density for the errors, $f(\epsilon)\propto e^{-g(\epsilon)}$.
Then the likelihood for the parameters, $\mathcal{L} = e^{-\sum_i g(\epsilon_i)}$ is maximized when $\sum_i g(\epsilon_i)$ is minimized.
Meanwhile, the loss function is minimized when $\sum_i d(\epsilon_i)$ is minimized.
That is, if the density $f$ is such that $g=d$, then minimizing $d$ is identical to MLE, because the thing we want to minimize is literally built into the density so that it maximizes the likelihood. [The same argument applies to $g$ being a monotonic-increasing transformation of $d$.]
The Gaussian density has $g=d_2$ -- it has the form
$f(\epsilon)\propto e^{-\frac12 \epsilon^2}$
So the likelihood is, up to a scaling factor, $e^{-\frac12 \sum_i\epsilon_i^2}$
The least squares criterion is literally built into the Gaussian density.
There's no general relationship between least squares and ML. There is a general relationship between least squares and ML for the mean of a Gaussian density because the likelihood is a monotonic decreasing function of the least squares loss function.

- 257,508
- 32
- 553
- 939
-
But this doesn't explain mu doubt. I don't understand why does it happen? I understand the mathematical derivation of both the solutions. Is it a coincidence or there is a more general relation between MLE with gaussian assumptions and RSS. – Abhinav Gupta Dec 12 '15 at 17:20
-
1It's the *opposite* of coincidence. I've just now tried explaining the same thing a different way. I really can't conceive any way to be plainer than that. – Glen_b Dec 12 '15 at 21:10