Why is linear regression different from PCA?

Question

I am taking Andrew Ng's Machine Learning class on Coursera and in the below slide he distinguishes principal component analysis (PCA) from Linear Regression. He says that in Linear Regression, we draw vertical lines from the data points to the line of best fit, whereas in PCA, we draw lines that are perpendicular to achieve the shortest distance.

I thought with linear regression we always use some Euclidean distance metric to calculate the error from what our hypothesis function predicts vs. what the actual data point was. Why doesn't it use the shortest distance a la PCA?

On the 1st picture you predict variable Y by variable X. Hence the errors you draw as lines are parallel to Y and perpendicular to X. On the 2nd picture there is no single observable variable which you predict. Instead, you predict _both_ X and Y variables at once. Hence the error cuts are at some angle to the two axes. The PC1 is the prediction line claiming to substitute X and Y by itself. X and Y are used to "model" the unobservable "latent" variable which in turn is considered to be predicting them two back. — ttnphns, Jul 28 '15 at 21:06

score 4 · Answer 1 · answered Jul 28 '15 at 21:04

With linear regression, we are modeling the conditional mean of the outcome, $E[Y|X] = a + bX$. Therefore, the $X$s are thought of as being "conditioned upon"; part of the experimental design, or representative of the population of interest.

That means any distance between the observed $Y$ and it's predicted (conditional mean) value, $\hat{Y}$ is thought of as an error and is given the value $r = Y - \hat{Y}$ as the "residual error". The conditional error of $Y$ is estimated from these values (again, no variability is considered on the behalf of $X$ values). Geometrically, that is a "straight up and down" kind of measurement.

In cases where there is measurement variability in $X$ as well, some considerations and assumptions must be discussed briefly to motivate usage of linear regression in this fashion. In particular, regression models are prone to nondifferential misclassification which may attenuate the slope of the regression model, $b$.

score 1 · Answer 2 · answered Jul 30 '15 at 15:18

I thought with linear regression we always use some Euclidean distance metric to calculate the error from what our hypothesis function predicts vs. what the actual data point was

You were absolutely right. It's Euclidean in this sense: the observations are dimensions. Think of your observations of dependent variable $y_i$, as random variable. So you have a $N$-dimensional vector $Y=(y_1,y_2,\dots,y_N)$. You estimate the model and obtain $N$-dimensional vector of predicted $\hat Y=(\hat y_1,\hat y_2,\dots,\hat y_N)$.

Now you minimize the sum of squares SSE, which is the squared Euclidean distance between the actuals and predicted: $||\hat Y-Y||^2=\sum_{i=1}^N (\hat y_i-y_i)^2$

Why is linear regression different from PCA?

2 Answers2