Why does linear regression use "vertical" distance to the best-fit-line, instead of actual distance?

Question

Linear regression uses the "vertical" (in two dimensions) distance of (y - ŷ). But this is not the real distance between any point and the best fit line.

I.e. - in the image here:

you use the green lines instead of the purple.

Is this done because the math is simpler? Because the effect of using the real distance is negligible, or equivalent? Because it's actually better to use a "vertical" distance?

There is such a thing as minimizing perpendicular distance. It is called Deming Regression. Ordinary linear regression assums the x value are known and the only error is in y. That is often a reasonable assumption. — Michael R. Chernick, Jul 14 '19 at 17:14
Sometimes the ultimate purpose of finding the regression line is to make predictions of $\hat Y_i$'s based on future $x_i$'s. (There is a 'prediction interval' formula for that.) Then it is vertical distance that matters. — BruceET, Jul 14 '19 at 17:29
@MichaelChernick I think your one-liner explained it best, maybe you can elaborate it a bit, and post it as an answer? — Maverick Meerkat, Jul 14 '19 at 17:36
I think Gung's answer is what I would say elaborating on my comment. — Michael R. Chernick, Jul 14 '19 at 18:54
Related: https://stats.stackexchange.com/questions/63966/other-ways-to-find-line-of-best-fit?rq=1 — Sycorax, Jul 14 '19 at 19:48
If you were regressing weight against height, do you care that your results will be essentially different depending on whether you used kilograms and metres, grams and metres, or kilograms and centimetres? — Henry, Jul 15 '19 at 09:18

score 12 · Accepted Answer · answered Jul 14 '19 at 17:24

Vertical distance is a "real distance". The distance from a given point to any point on the line is a "real distance". The question for how to fit the best regression line is which of the infinite possible distances makes the most sense for how we are thinking about our model. That is, any number of possible loss functions could be right, it depends on our situation, our data, and our goals (it may help you to read my answer to: What is the difference between linear regression on y with x and x with y?).

It is often the case that vertical distances make the most sense, though. This would be the case when we are thinking of $Y$ as a function of $X$, which would make sense in a true experiment where $X$ is randomly assigned and the values are independently manipulated, and $Y$ is measured as a response to that intervention. It can also make sense in a predictive setting, where we want to be able to predict values of $Y$ based on knowledge of $X$ and the predictive relationship that we establish. Then, when we want to make predictions about unknown $Y$ values in the future, we will know and be using $X$. In each of these cases, we are treating $X$ as fixed and known, and that $Y$ is understood to be a function of $X$ in some sense. However, it can be the case that that mental model does not fit your situation, in which case, you would need to use a different loss function. There is no absolute 'correct' distance irrespective of the situation.

score 0 · Answer 2 · answered Jul 14 '19 at 19:10

0

Summing up Michael Chernick comment and gung answer:

Both vertical and point distances are "real" - it all depends on the situation.

Ordinary linear regression assumes the $X$ value are known and the only error is in the $Y$'s. That is often a reasonable assumption.

If you assume error in the $X$'s as well, you get what is called a Deming regression, which fits a point distance.

answered Jul 14 '19 at 19:10

Maverick Meerkat

2,147
14
27

2

I don't see that this answer needs to be downvoted. – gung - Reinstate Monica Jul 15 '19 at 11:20

Why does linear regression use "vertical" distance to the best-fit-line, instead of actual distance?

2 Answers2