0

I came across two formulas of empirical variance. My professor used this one : $$ s^2=\sum_{i=0}^k n_i(x_i-m)^2 $$ with $m$ being the mean of the series. But it didn't make much sense to me and when I looked on the internet I found this more common formula: $$ s^2=\sum_{i=0}^n(x_i-m)^2 $$ Now I'm confused as to what formula to use and what's the difference between the two.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
wageeh
  • 147
  • 4

2 Answers2

3

Of course, $ k $ will be the number of groups (repeated measures) and $ n_i $ the number of repeated measures from that observation.

For example, consider the following vector $ X = (1,2,2,3,3,3) $. The sum of squares by the second formula will be $$ s ^ 2 = \sum_ {i = 6} (x_i- \mu) = 1.778 + 0.111 + 0.111+ 0.444 + 0.444 + 0.444 = 3.333 $$ Do you notice the repeated measures in the formula? The first formula follows from the grouping of these measures. $$ s ^ 2 = \sum_ {k} n_i(x_i- \mu) = 1.778 + 2 \times 0.111 + 3 \times 0.444 = 3.333 $$

jassis
  • 532
  • 2
  • 9
2

The second formula is the general one, where $x_i$’s are the raw data, e.g. every for every $i$-th person you record their age and there’s $n$ people in your sample.

You can use the first formula for aggregated data, where $x_i$’s are the district age categories and $k_i$’s are their counts. For example, you observed $n_i=137$ people of $x_i=26$ years old in your sample. In this case, there’s $n$ distinct age categories and the total number of observed samples is $\sum_{i=1}^k n_i$.

Other case for first formula is when $n$ is the number of samples and $k_i$ are sample weights, where you want to have some samples have more impact on the result than others.

Notice that variance is an average squared deviation from the mean, so in case of both formulas you need to divide by the total sample size, i.e. by $n$ in second case, or by $\sum_{i=1}^k n_i$ in second case. For the sample variance, you would divide by $n-1$.

Tim
  • 108,699
  • 20
  • 212
  • 390