1

When I calculate the variance by hand I get something different than Rstudio.

They guy in the video however, did calculate it as I did, wrong. Why is that?

My calculations:

Observations $=1,2,3,4,5,6,7,8,9$

Calculation by hand: $$E[X] = \frac{1}{n}*\sum^n_1 x_i = \frac{45}{9} = 5 \\\text{or}\\ E[X] = \sum x_i *\frac{1}{9} = 5 \\ Var(X) = E[(X-\mu)^2]=\sum (x_i - \mu)^2* \frac{1}{9} = \frac{20}{3} \\\text{or}\\ Var(X) = \left(\frac{1}{n} * \sum x_i^2\right)-\mu^2 = \frac{20}{3}$$

However when I use the following code in R i get different results.

a <- c(1:9)
mean(a) % = 5
var(a) % = 7.5

Questions:

What is happening here\Why are the results different?
Are the formulas I used for the calculation by hand correct?

  • 3
    The `var` function in R estimates the sample variance, which is calculated as $$Var(X)=\dfrac{1}{n-1}\sum_i (x_i-\bar{x})^2$$. – user2974951 Feb 18 '19 at 14:16
  • Ah I see. Which method is correct? – Jürgen Erhardt Feb 18 '19 at 14:18
  • The formula you presented looks like is estimating the *population variance*, which is similar except divided by $n$ instead of $n-1$, also using the population mean $\mu$, rather than the sample mean $\bar{x}$. Which one you choose depends on whether you have a sample or population data. As $n$ gets larger this becomes less important (in general). – user2974951 Feb 18 '19 at 14:20
  • I see. So $\frac{1}{n}$ for sample data and $\frac{1}{n-1}$ for the whole population data? – Jürgen Erhardt Feb 18 '19 at 14:24
  • Exactly the opposite. – user2974951 Feb 18 '19 at 14:24
  • 3
    @user2974951, "sample variance" is a jargon. It [isn't](https://stats.stackexchange.com/a/16987/3277) "variance in the sample", it is [estimating population variance](https://stats.stackexchange.com/a/17893/3277). The "n-1" denominator makes this estimator unbiased (given that the mean with use is a sample mean). The "n" denominator makes the estimator biased, it is called maximum likelihood estimate of population variance. – ttnphns Feb 18 '19 at 14:43
  • @ttnphns Yes thank you for the corrections, my answers were written hastily and I did not express myself exactly. Also I did not want to bother OP too much about matters of bias. – user2974951 Feb 18 '19 at 14:46

1 Answers1

1

The formula suggests the author is estimating the variance from a population, which is defined as $$Var(X)=\dfrac{1}{n}\sum_i x_i^2-\mu^2$$

However, if all you have is a sample from a population, then the unbiased formula for the population variance is defined as $$Var(X)=\dfrac{1}{n-1}\sum_i (x_i-\bar{x})^2$$

Notice the dfference of $n-1$ instead of $n$ and the sample mean $\bar{x}$ rather than the population mean $μ$.

The R function var by default estimates the variance using the second formula, as that is almost always the case in statistics (dealing with a sample rather than a population).

user2974951
  • 5,700
  • 2
  • 14
  • 27