0

Learned today in stats this cool piece of information:

If you wanted to optimize the $(y-\overline{y})^2$ or differences squared, of some data. Where $\overline{y}$ is a constant y = number, then you get the average or mean.

Basically if you want to do a one dimensional regression where the best fit line is $y= b$. $b$ is the mean.

If you do this but you want to just optimize $(y-\overline{y})$ then you get some number that isn't the mean.

This was mind blowing to me that the best fit line $y=b$ of data is the mean. Why does the mean come from these differences squared and not the differences?

MachineLearner
  • 336
  • 1
  • 9

2 Answers2

2

Without going too deep into the math, you are referring to optimization, so if we minimized

$$ \operatorname{arg\,min}_\mu \; \sum_i (x_i - \mu) $$

Than if you were looking for such value of $\mu$ that whatever you subtract it from it gives you the smallest possible value, you could just set it to $\infty$ and the result will always be the smallest possible.

Instead, you are interested in minimizing distance between $x_i$ and $\mu$ values. Distance needs to be non-negative (same as in real life, you cannot be -13 km away from the nearest McDonalds). Examples of the distances are $L_1$ norm defined as $\sum_i |x_i - y_i|$, or $L_2$-norm defined as $\sum_i |x_i - y_i|^2$, (since it's squared it doesn't really matter if you use absolute value in here, or not) etc. Minimizing $L_1$ means calculating median, minimizing $L_2$ leads to calculating mean. The links will give you more details.

Tim
  • 108,699
  • 20
  • 212
  • 390
0

You have to know a little bit of calculus (+ chain rule). For your first example sum of squared errors you have

$$F(b) = \sum_{n=1}^N[y_n - b]^2$$

In order to minimize the error $F(b)$ we have to differentiate with respect to $b$ and set the derivative equal to $0$.

$$\implies \dfrac{dF}{db} = \sum_{n=1}^N2[y_n-b](-1)=0$$ $$\implies \sum_{n=1}^Ny_n - \sum_{n=1}^Nb =0$$ $$\implies b = \dfrac{1}{N}\sum_{n=1}^Ny_n.$$

Now, if we try the same the second error function

$$G(b) = \sum_{n=1}^N[y_n-b]$$ $$\implies \dfrac{dG}{db} = \sum_{n=1}^N(-1) = -N \neq 0.$$

Hence, the second error function does not allow for solving the necessary condition for an extremal point. This was clear from the beginning because $b\to -\infty$ will make $G(b) \to \infty$ and $b \to \infty$ will make $G(b) \to -\infty$.

MachineLearner
  • 336
  • 1
  • 9