8

I'm reading this lecture on Linear, Ridge Regression, and PCA. In slide 10 it says that:enter image description here

The things that I don't understand is the 5th statement which says that $\mathbf{y} - \mathbf{\hat{y}} $ is perpendicular to the subspace. Why is this the case?

YellowPillow
  • 1,031
  • 2
  • 9
  • 16
  • The least squares criterion basically amounts to finding the "closest" point in the column space of $X$ to use as an approximation of $y$, where closest is defined in terms of Euclidean distance. This closest point is the one found by moving onto the column space of $X$ in a "perpendicular" direction. – dsaxton Oct 19 '16 at 02:27
  • Should the title say "least squares" rather than "regression"? (For example, is the residual is orthogonal to the column-space in the [$L_1$](https://en.wikipedia.org/wiki/Least_absolute_deviations) variant?) – GeoMatt22 Oct 19 '16 at 03:22

2 Answers2

10

You have to think about it geometrically in terms of vectors and distances between them!

To understand the idea refer to the next slide: enter image description here

In this example, you have two feature vectors $\mathbf{x}_1$ and $\mathbf{x}_2$ (so $p=2$). These vectors are in 3D space (so $N=3$).

The vector $\mathbf{y}$ is a vector in this 3D space and is given!

The goal is to find the linear combination $\hat{\mathbf{y}}$ (i.e. finding the coefficients $\beta_j$, refer to previous slides) of $\mathbf{x}_1$ and $\mathbf{x}_2$ that allows you to get as close as possible to $\mathbf{y}$.

Back to the example, since you have only 2 feature vectors $\mathbf{x}_1$ and $\mathbf{x}_2$, all their possible linear combinations (from which we will choose one that becomes $\hat{\mathbf{y}}$) will form a plane. We call it the span of the two vectors. This means that $\hat{\mathbf{y}}$ can only live on this plane.

The trick to understand now is to think of $\hat{\mathbf{y}}$ and $\mathbf{y}$ as geometric vectors not only algebraic vectors.

Let's note $\mathbf{e}=\mathbf{y} - \hat{\mathbf{y}}$ which is equivalent to writing $\mathbf{y} = \hat{\mathbf{y}}+\mathbf{e}$ which geometrically means that to get $\mathbf{y}$ you have to add $\mathbf{e}$ to $\hat{\mathbf{y}}$ and $\mathbf{e}$ then represents what separates $\hat{\mathbf{y}}$ from $\mathbf{y}$. Its modulus represents the distance between the two vectors $\hat{\mathbf{y}}$ and $\mathbf{y}$. Patience, we are almost there... :-)

The goal is to minimize this distance. If you refer the the figure above and imagine moving around your $\hat{\mathbf{y}}$ vector inside the subspace spanned by $\mathbf{x}_1$ and $\mathbf{x}_2$ (i.e. the plane) (you also have to imagine $\mathbf{e}$ moving with it going from the head of the vector $\hat{\mathbf{y}}$ to the head of the vector $\mathbf{y}$), then, where do you think that the distance will be minimal?

This happens when $\hat{\mathbf{y}}$ is just under $\mathbf{y}$ such that $\mathbf{e}$ becomes perpendicular to the subspace.

Conclusion:

Minimizing the distance (technically the squared distance) between $\hat{\mathbf{y}}$ and $\mathbf{y}$ is equivalent to having the vector representing this distance perpendicular to the subspace spanned by the feature vectors!

Learn_and_Share
  • 736
  • 1
  • 9
  • 18
1

If you prefer an algebraic answer to the +1 geometric answer by MedNait, note that $$ y-\hat y=My $$ with $M=I-X(X'X)^{-1}X'$ the "residual maker matrix". Then, $$MX=X-X(X'X)^{-1}X'X=X-X=0$$

Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106