Relationship between MLE and least squares in case of linear regression

Question

Hastie and Tibshirani mention in section 4.3.2 of their book that in the linear regression setting, the least squares approach is in fact a special case of maximum likelihood. How can we prove this result?

PS: Spare no mathematical details.

It's not a special case: they are just identical when the error distribution is normal. — Zhanxiong, Jan 01 '17 at 00:50

score 14 · Accepted Answer · edited Apr 13 '17 at 12:44

The linear regression model

$Y = X\beta + \epsilon$, where $\epsilon \sim N(0,I\sigma^2)$

$Y \in \mathbb{R}^{n}$, $X \in \mathbb{R}^{n \times p}$ and $\beta \in \mathbb{R}^{p}$

Note that our model error (residual) is ${\bf \epsilon = Y - X\beta}$. Our goal is to find a vector of $\beta$s that minimize the $L_2$ norm squared of this error.

Least Squares

Given data $(x_1,y_1),...,(x_n,y_n)$ where each $x_{i}$ is $p$ dimensional, we seek to find:

$$\widehat{\beta}_{LS} = {\underset \beta {\text{argmin}}} ||{\bf \epsilon}||^2 = {\underset \beta {\text{argmin}}} ||{\bf Y - X\beta}||^2 = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} ( y_i - x_{i}\beta)^2 $$

Maximum Likelihood

Using the model above, we can set up the likelihood of the data given the parameters $\beta$ as:

$$L(Y|X,\beta) = \prod_{i=1}^{n} f(y_i|x_i,\beta) $$

where $f(y_i|x_i,\beta)$ is the pdf of a normal distribution with mean 0 and variance $\sigma^2$. Plugging it in:

$$L(Y|X,\beta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - x_i\beta)^2}{2\sigma^2}}$$

Now generally when dealing with likelihoods its mathematically easier to take the log before continuing (products become sums, exponentials go away), so let's do that.

$$\log L(Y|X,\beta) = \sum_{i=1}^{n} \log(\frac{1}{\sqrt{2\pi\sigma^2}}) -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$

Since we want the maximum likelihood estimate, we want to find the maximum of the equation above, with respect to $\beta$. The first term doesn't impact our estimate of $\beta$, so we can ignore it:

$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmax}}} \sum_{i=1}^{n} -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$

Note that the denominator is a constant with respect to $\beta$. Finally, notice that there is a negative sign in front of the sum. So finding the maximum of a negative number is like finding the minimum of it without the negative. In other words:

$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 = \widehat{\beta}_{LS}$$

Recall that for this to work, we had to make certain model assumptions (normality of error terms, 0 mean, constant variance). This makes least squares equivalent to MLE under certain conditions. See here and here for more discussion.

For completeness, note that the solution can be written as:

$${\bf \beta = (X^TX)^{-1}X^Ty} $$

Hi, thanks for the great answer! I have a doubt: how did we arrive at the expression for $f(y_i\ |\ x_i,\beta)$ from the fact that $\epsilon$ are normally distributed? I can't trace the steps in between. — user9343456, May 23 '21 at 01:17

Relationship between MLE and least squares in case of linear regression

1 Answers1

Linked