Why the MSE function has the square?

Question

There is mse function: C = $\frac{1}{2n}$ * $\sum(length(y - a)^2)$

why not just use C = $\sum(length(y - a))$ ?

(where "length" is the vector's length, "y" - ideal network's output, "a" - current network output)

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

1

You're talking about L1 norm and L2 norm. Both work for neural networks. However, they are different:

L1 norm is better for sparsity and robust against outliers
L2 norm is more sensitive to large errors (square those large errors)
Their first derivative is very different. I don't want to repeat what someone has already written. Look at https://stats.stackexchange.com/a/159379/34623.
How those error functions update your weight is different (gradient). This has significant impact on your convergence in stochastic gradient decent (or something like that).
http://www.chioka.in/differences-between-the-l1-norm-and-the-l2-norm-least-absolute-deviations-and-least-squares/

Without more information, I can't comment on how L2 norm is better (or worse) for your problem.

edited Apr 13 '17 at 12:44

Community

1

answered Mar 09 '17 at 09:57

SmallChess

6,764
4
27
48

but this problem is solved by length function, is not it? – Dmytro Nalyvaiko Mar 09 '17 at 10:01
@DmitryNalyvaiko What do you mean by length? $length(a - b) = |a - b|$? – Łukasz Grad Mar 09 '17 at 10:03
@ŁukaszGrad y - a = some_vector. we take the length of the some_vector to: 1) make result scalar, 2) make difference absolute – Dmytro Nalyvaiko Mar 09 '17 at 10:04

score -1 · Answer 2 · answered Mar 09 '17 at 10:46

-1

Short answer: both can be used.

Longer answer: both measures are in active use. The first measure is based on the Euclidean distance, the second one on the taxi-cab distance. Or more formally: the $L_2$ distance and the $L_1$ distance.
Which is better depends on the context. Intuitively: the Euclidean distance prefers many small/medium errors over a few big errors while the taxi-cab distance is more forgiving when it comes to a few large errors. Which one is preferable depends on the context and what you are trying to achieve.

answered Mar 09 '17 at 10:46

dimpol

882
4
13

you are talking about length function, but I'm asking why result of length function should be powerd by 2? – Dmytro Nalyvaiko Mar 09 '17 at 11:59
That could be because you want to 'punish' big errors in certain training cases over the same error spread over multiple training cases. Suppose over 2 training cases one algorithm has an error of 0 and 3 respectively and another algorithm has an error of 2 for both training cases. Squaring the errors would make the second algorithm preferable, not squaring would have the first algorithm as better. The right choice depends on the context – dimpol Mar 09 '17 at 12:12
> Squaring the errors would make the second algorithm preferable 0^2 + 3^2 = 9; 2^2 + 2^2 = 8; why second algorithm is preferable? – Dmytro Nalyvaiko Mar 09 '17 at 13:43
An error-score of 8 is lower than an error-score of 9, lower error-score is preferable. – dimpol Mar 10 '17 at 22:20

Why the MSE function has the square?

2 Answers2