mean and least error on the line

Question

Suppose I have a line on which i have points in non-decreasing order. My intuition tells me that if I want to minimize the squared mean error on some subset of 3 (or other number) points, I would like to take 3 subsequent points, or rather check all 3-subsets of subsequent numbers on the line to find one with lowest square mean error. Suppose I have 4 points, and subset of points a, b, d, with c point in between points b and d. All points are in increasing order: a__b__c___d.

Now, I have square mean error for a, b, and d: f(a, b, d) = (a-m)^2 + (b-m)^2 + (d-m)^2, where m is arithmetic mean of the subset. My intuition tells me, that if I now choose subset a, b, and c, then f(a, b, c) < f(a, b, d).

HOW can I prove THIS? I tried to do it myself using geometry and using a lot of arithmetic but it is still elusive. Cannot also formulate the question on google to look for answer.

Have you tried Konig-Huygens? Is m the mean of a, b, c *and* d, or only of three? — Has QUIT--Anony-Mousse, Nov 03 '16 at 06:45
Hi, only three. Also the proof should be extended to arbitrary number of points. — SWIM S., Nov 03 '16 at 09:42
Definitely try König-Huygens. It should make the proof easy. — Has QUIT--Anony-Mousse, Nov 03 '16 at 20:37
But this theorem states that E[X^2]-EX^2 = Variance. Im quite unsure if it fits in here because there is no probability involved in my example, but if i consider mean as expectation of points with equal probability....mmm..that might work...THanks for suggestion — SWIM S., Nov 03 '16 at 22:41
Try expressing f(a,b,c) in terms of variance. then f(a,b,c)-f(a,b,d) and remove common terms. Maybe factor to get (c-d). — Has QUIT--Anony-Mousse, Nov 03 '16 at 22:54

whuber · Answer 1 · 2021-07-14T21:41:58.470

A "mean square error" is a multiple of a variance; the multiplier depends only on the count of data. Thus it appears you want to prove this lemma:

When $x_n \lt x_{n+1}$ are the two largest values in a dataset of $n+1$ values $(x_1,x_2,\ldots, x_n, x_{n+1}),$ the variance of $(x_1,x_2,\ldots, x_{n-1}, x_{n+1})$ exceeds the variance of $(x_1,x_2,\ldots,x_{n-1},x_n).$

The variance is proportional to the sum of squares of all possible differences of the data. (Again, the constant of proportionality depends only on the count $n.$) When you replace $x_n$ by $x_{n+1},$ you increase all the differences between the highest value and the other $n-1$ values, eaving the count the same and without changing any of the other differences. Therefore the variance increases, QED.

Comments

Because there are always many ways to demonstrate algebraic relations like these, there are other solutions to this one. https://stats.stackexchange.com/a/361801/919 shows another way: apply it to remove $x_n$ and then to adjoin $x_{n+1}$ to the dataset.

A purely algebraic demonstration of the key assertion -- the variance can be expressed as a sum of squared differences -- appears at https://stats.stackexchange.com/a/225758/919.

If you prefer a Calculus approach, you can also define a function $f(x)$ to be the variance of the dataset $(x_1,x_2,\ldots, x_{n-1},x)$ for all $x$ exceeding the largest of the $x_i.$ This is a quadratic function of $x,$ so compute the derivative $f^\prime(x)$ and show it is always positive.

mean and least error on the line

1 Answers1

Comments