Squaring floats between -1 and 1 reduces sum of squares, so why do it?

Question

I have been learning basic statistical testing as it relates to agriculture and have become familiar with the common practice of summing squared raw deviation values, whether in something simple like a standard deviation calculation or something more complex, like ANOVA and Chi square.

I understand the logic in squaring the values so that negative numbers do not reduce the total sum of squares to zero. It makes sense. But when working with deviations that are decimals between -1 and 1 the act of squaring the values reduces the sum of squares, which is counter intuitive to the idea of squaring the values in the first place.

Why is the reduction in overall sum associated with squaring simple deviations between -1 and 1 acceptable? That part of the process seems to run in opposition to the concept of squaring the values in the first place.

score 1 · Accepted Answer · answered Oct 19 '18 at 16:47

What you're discovering by examining the formulas is the concept of a norm. (Obligatory Wikipedia).

The way we compute distances in Euclidean space is by taking the roots of sums of squares. Why? Well, take the points A = (0,0), B=(0,1), C=(0,-1). It stands to reason that the length from A to B should be the same as the distance from A to C -- the fact that one is to the left and one is to the right is irrelevant to the concept of distance. A norm is something just slightly more abstract -- it's a numeric value $\|v\|$ associated to a vector $v$. A vector is an element of a vector space, which is a structure whose elements can be added together or multiplied by a scalar. An array of numbers $(v_1, v_2, \cdots, v_n)$ is a vector if you define that adding two vectors is adding them component by component and multiplying by a number is also defined component-wise. All kinds of weird things are vectors, but arrays of real numbers are the most common.

But getting back on track -- all norms induce a metric/distance by saying $d(v,w) = \| v - w\|$ and it will probably be a few years before you encounter metrics not induced by norms so we can think of norms as just being the distances from the origin. So finally on to your question, the sum of squared residuals is the (squared) Euclidean norm of that vector of residuals.

To summarize how all of this applies to your question:

Usually by definition residuals have zero mean and therefore sum to zero. I.e. they cancel each other out. Otherwise, you would have some structure to the data that was missed by your model! This obviously isn't happening with normal well-formed statistical tools such as linear regression, ANOVA, etc.
The reason you square your residuals is because it's useful to see them as a vector and calculate their length. So "larger" residuals are larger residuals. Soon enough you'll find that things that don't correlate are represented by vectors that are perpendicular to each other. Geometry is fun and useful.
Yes, squaring numbers in $[-1,1]$ shrinks them, and strictly speaking to calculate lengths you have to take the square root of the sum again to bring things to the same scale. But often models do things like minimizing some length -- e.g. in linear regression we look for $\beta$ that minimizes

$$ \|y - X\beta\| = \sqrt{\sum_i (y_i - x_i \beta)^2} $$

and minimizing the function $g(\beta) = \sqrt{f(\beta)}$ yields the same $\beta$ -- so we may as well skip the parts that take square roots. (Think about this for a minute).

Fun fact about Euclidean distances/norms: the singular number $\mu$ that minimizes the norm of the residuals $x_1 - \mu, x_2 - \mu,\cdots$ is the arithmetic mean. So regression is a conditional mean.
Funner fact: Euclidean distance/norm -- remember, given by

$$ \|v\| = \sqrt[2]{\sum_i x_i^2} $$

is the most common and obvious because it's the length of lines you draw on paper, but you could use any $p$ in the formula

$$ \|v\|_p = \sqrt[p]{\sum_i |x_i|^p} $$

(but note the trick, I put $|x_i|$ in absolute value so this works for odd $p$). These are called Minkowski norms, and the most popular are $p=1$ and $p=\infty$. Let's look at the former:

$$ \|v\| = \sum_i |x_i| $$

The distance obtained from this norm is sometimes called "Manhattan distance" because Manhattan in New York is in a square grid. So if we were to go from (0,0) to (1,1), in the Euclidean (2-norm) world I can go in a straight line like a bird, but in the Manhattan (1-norm) world I need to walk the entire side of a block and then take a turn to walk the other end. The Euclidean distance is simply 1, but the Manhattan distance is 2.

This has many many applications to statistics. Again, the singular number $\mu$ that minimizes the Manhattan norm of the residuals $x_1 - \mu, x_2 - \mu,\cdots$ is the median, and you can extend that to linear regression and so on. You just have to do the math (the distribution of your test statistics won't be $\chi^2$ anymore, for one).

Here's hoping that this sort of addresses your questions and makes you hungry for more!

Squaring floats between -1 and 1 reduces sum of squares, so why do it?

1 Answers1