Hastie and Tibshirani mention in section 4.3.2 of their book that in the linear regression setting, the least squares approach is in fact a special case of maximum likelihood. How can we prove this result?
PS: Spare no mathematical details.
Hastie and Tibshirani mention in section 4.3.2 of their book that in the linear regression setting, the least squares approach is in fact a special case of maximum likelihood. How can we prove this result?
PS: Spare no mathematical details.
The linear regression model
$Y = X\beta + \epsilon$, where $\epsilon \sim N(0,I\sigma^2)$
$Y \in \mathbb{R}^{n}$, $X \in \mathbb{R}^{n \times p}$ and $\beta \in \mathbb{R}^{p}$
Note that our model error (residual) is ${\bf \epsilon = Y - X\beta}$. Our goal is to find a vector of $\beta$s that minimize the $L_2$ norm squared of this error.
Least Squares
Given data $(x_1,y_1),...,(x_n,y_n)$ where each $x_{i}$ is $p$ dimensional, we seek to find:
$$\widehat{\beta}_{LS} = {\underset \beta {\text{argmin}}} ||{\bf \epsilon}||^2 = {\underset \beta {\text{argmin}}} ||{\bf Y - X\beta}||^2 = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} ( y_i - x_{i}\beta)^2 $$
Maximum Likelihood
Using the model above, we can set up the likelihood of the data given the parameters $\beta$ as:
$$L(Y|X,\beta) = \prod_{i=1}^{n} f(y_i|x_i,\beta) $$
where $f(y_i|x_i,\beta)$ is the pdf of a normal distribution with mean 0 and variance $\sigma^2$. Plugging it in:
$$L(Y|X,\beta) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(y_i - x_i\beta)^2}{2\sigma^2}}$$
Now generally when dealing with likelihoods its mathematically easier to take the log before continuing (products become sums, exponentials go away), so let's do that.
$$\log L(Y|X,\beta) = \sum_{i=1}^{n} \log(\frac{1}{\sqrt{2\pi\sigma^2}}) -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$
Since we want the maximum likelihood estimate, we want to find the maximum of the equation above, with respect to $\beta$. The first term doesn't impact our estimate of $\beta$, so we can ignore it:
$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmax}}} \sum_{i=1}^{n} -\frac{(y_i - x_i\beta)^2}{2\sigma^2}$$
Note that the denominator is a constant with respect to $\beta$. Finally, notice that there is a negative sign in front of the sum. So finding the maximum of a negative number is like finding the minimum of it without the negative. In other words:
$$ \widehat{\beta}_{MLE} = {\underset \beta {\text{argmin}}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 = \widehat{\beta}_{LS}$$
Recall that for this to work, we had to make certain model assumptions (normality of error terms, 0 mean, constant variance). This makes least squares equivalent to MLE under certain conditions. See here and here for more discussion.
For completeness, note that the solution can be written as:
$${\bf \beta = (X^TX)^{-1}X^Ty} $$