2

I am learning regession, and I dont understand why do we need to square in residual sum of squares. whats wrong with just use residual sum to represent as the error value? What is the benefit of squaring the residual?

Bryan Fok
  • 133
  • 1
  • 7
  • See also http://stats.stackexchange.com/questions/118/why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia – Tim Sep 27 '16 at 14:30

3 Answers3

4

@ocram's answer is good, but one point I'd add is the connection between least squares and maximum likelihood estimation. If we have a regression model of the form $y_i = \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} + \epsilon_i$ where the $\epsilon_i$ are independent normal$(0, \sigma^2)$ random variables then the likelihood function becomes

$$ \mathcal{L}(\beta) = \frac{1}{\sigma^n \sqrt{2 \pi}^n} \exp \left ( - \frac{\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2}{2 \sigma^2} \right ) . $$

If we want to maximize this as a function of $\beta$ that's equivalent to minimizing $\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2$, and this is nothing but the least squares criterion.

It's also interesting to note that means themselves are least squares estimates in the univariate case, so if we agree that means are good things to look at then least squares makes sense.

dsaxton
  • 11,397
  • 1
  • 23
  • 45
  • Doesn't the pdf of normal distribution follow from the fact that we use squared residuals? We construct the likelihood based on the pdf of the normal distribution. If the answer to my question is yes (which I think), of course maximizing the likelihood is the same as minimizing SSR, since it follows by construction (right)? If the answer is no, could you please explain why (or should I ask a seperate question)? – Marcel10 Sep 27 '16 at 14:48
  • You could of course derive it that way but you don't have to. No where in the central limit theorem for instance do we insist that the square function needs to appear in the density of the limit, so it appears even when we don't require it. – dsaxton Sep 27 '16 at 14:55
  • Thanks, makes a lot of sense. Also after a quick google search I found this [post](http://math.stackexchange.com/questions/384893/how-was-the-normal-distribution-derived), which indeed shows that it is definitely not needed. – Marcel10 Sep 27 '16 at 14:58
  • It seems that the square function arises out of the Pythagorean theorem, which might give us more comfort that it's a natural thing to look at. – dsaxton Sep 27 '16 at 15:12
3

If you do not square, a negative residual (below the line) can offset the impact of a positive residual (above the line). Squaring is a remedy. Taking the absolute values of the residuals provides an alternative. But squaring is much easier to handle from a mathematical point of view (cf. derivatives).

ocram
  • 19,898
  • 5
  • 76
  • 77
  • so true. square will not give negative value. for do calculation in computer, i believe taking the absolute value is also the same as doing square! – Bryan Fok Sep 27 '16 at 14:25
  • 3
    @BryanFok it is *not* the same. Square provides a larger penalty for large residuals than taking the absolute value. – Matthew Gunn Sep 27 '16 at 14:28
  • @MatthewGunn That is very good point too. as we want to give penalty to prediction which is far off from our prediction. In this case, can we use power of 3 instead of square? :) – Bryan Fok Sep 27 '16 at 14:31
  • 1
    @BryanFok No, a negative value to the power of 3 is also a negative value... So see the answer of ocram – Marcel10 Sep 27 '16 at 14:39
  • 1
    @BryanFok To the power of 'an even number larger than 2' can be used, but is (as far as I know) very unconventional – Marcel10 Sep 27 '16 at 14:40
3

Squaring the residuals changes the shape of the regularization function. In particular, large errors are penalized more with the square of the error. Imagine two cases, one where you have one point with an error of 0 and another with an error of 10, versus another case where you have two points with an error of 5. The linear error function will treat both of these as having equal sum of residuals, while the squared error will penalize the case with the large error more.

With a squared residual, your solution will prefer more small errors to having any large errors. The linear residual is indifferent, not caring whether the total error is all coming from one sample or spread out as a sum of many tiny errors.

You could also raise the error to a higher power to penalize large errors even more. Summing the tenth power of the residuals, for example, would likely result in a solution that has small errors for most points, but no large errors for any one point.

Nuclear Hoagie
  • 5,553
  • 16
  • 24
  • it is always more clear with a lovely example. thank you so much. – Bryan Fok Sep 27 '16 at 14:34
  • In even simpler terms, does this allow a linear regression to not be overly influenced by an outlier? – Prithvi Boinpally Feb 14 '22 at 05:46
  • 1
    @PrithviBoinpally Higher powers of the error term (square, cube, etc.) make the fit *more* sensitive to outliers. Rather than fitting the main cloud of points and missing the outlier by a wide margin, higher power error terms will result in a fit that misses all the points by a smaller amount. Without the square, eleven errors of 1 unit is worse than one error of 10 units. With the square, one error of 10 units is worse than ninety-nine errors of 1 unit. – Nuclear Hoagie Feb 14 '22 at 14:26