0

I have a naive question.

I have list a

a = [1, 2, 3, 6] .

mymean = sum(a)/len(a)

To calculate the std I use:

math.sqrt(sum([(i-mymean)**2 for i in a])/len(a) = 1.87

But I found this formula which is also right but I do not understand:

math.sqrt(sum([i**2 for i in a])/len(a) - (sum(a)/len(a))**2)

Can you please help understand the second formula?

Medhat Helmy
  • 103
  • 2
  • What is len(a)? Is it the range of the values? – Michael R. Chernick Feb 01 '19 at 20:41
  • in python `len(a)` means the number of element in a list. – Medhat Helmy Feb 01 '19 at 22:10
  • Okay I get it now. – Michael R. Chernick Feb 01 '19 at 23:16
  • @Medhat Do NOT use that second formula for computer calculation. In fairly common situations it leads to catastrophic cancellation (i.e. it can be disastrously inaccurate). – Glen_b Feb 02 '19 at 02:27
  • @Glen_b Thanks I didn't know that actually it was suggested as awk pipeline to calculate std, now I will reconsider using it. – Medhat Helmy Feb 02 '19 at 03:33
  • 1
    *Algebraic* equivalence does not mean "equally suitable for calculation"! ... If you *need* a fast (but pretty stable) single pass variance (and hence, standard deviation) calculation, better algorithms are already on site. E.g. there's one [here](https://stats.stackexchange.com/questions/72212/updating-variance-of-a-dataset/72215#72215). As an alternative, you can reduce the problem substantially by subtracting a good guess at the mean from each data value (even just subtracting the first observation from each data point before you use the naive formula would improve things). – Glen_b Feb 02 '19 at 03:39

1 Answers1

1

The 2 formulas are mathematically equivalent:

$\frac{\sum{(x_i - \mu)^2}}{n} =$

$\frac{\sum{(x_i^2 - 2 x_i \mu + \mu^2)}}{n} =$

$\frac{\sum{x_i^2}}{n} - \frac{2 \mu \sum{x_i}}{n} + \frac{n \mu^2}{n} =$

$\frac{\sum{x_i^2}}{n} - 2 \mu^2 + \mu^2 =$

$\frac{\sum{x_i^2}}{n} - \mu^2$

The first form requires you to loop through the data 2 times, once to compute the mean, then a second time to compute the variance (the square of the standard deviation). The second form can loop through the data 1 time calculating the sum of the values and the sum of the squares of the values, then combining them. The second is preferred when you only want to go through the data once (can give speed advantages for some big data cases), but the first is often less affected by rounding error, so both are still used.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159