Calculating sample mean and sample variance on all samples vs distinct subsets

Question

Consider 1000 samples drawn from an unknown distribution. Whats the difference between the following two ways of calculating sample mean and sample variance?

find sample mean and sample variance over all 1000 samples.
find sample mean over first 500 samples and sample variance over remaining 500 samples.

What is the difference and which method is preferred?

varty · Answer 1 · 2011-11-28T15:56:59.520

2

The first method is preferred as you are using all the information in the sample to get more 'accurate' measures of central tendency (i.e., the sample mean) and the spread (i.e., the variance) in your data.

More formally, if $x_i$ is from an unknown distribution but are iid with finite mean and variance then it follows from the central limit theorem that:

$$\bar{x} \sim N(\mu,\frac{\sigma^2}{n})$$

Informally the above suggests the following two 'facts':

The sample mean is an unbiased estimator of the true mean as the distribution of the sample mean is centered at the true mean and
The spread of the distribution is narrower with higher $n$.

Fact 1 suggests that the sample mean will remain unbiased with a smaller sample size (i.e., using 500 instead of 1000 should be fine if you are concerned if the sample mean is biased or not) but fact 2 suggests that the distribution of $\bar{x}$ is 'narrower' with higher $n$ and hence it makes sense to use the full sample of 1000 data points rather than 500 data points. Intuitively, facts 1 and 2 together indicate that the higher the sample size used to compute the sample mean, the lower chances that the sample mean is 'far away' from the true mean.

A similar argument for the sample variance can be made. See the wiki on the distribution of the sample variance. In particular see the formula for $E(s^2)$ (i.e., the mean of the sample variance) and $\text{V}(s^2)$ (i.e., the variance of the sample variance).

edited Nov 28 '11 at 15:56

answered Nov 28 '11 at 15:32

varty

1,276
8
6

3

This is a fine start. Two suggestions to make it stronger: (1) you don't need the central limit theorem, which is just an unnecessary distraction for nitpickers :-). A [law of large numbers](http://en.wikipedia.org/wiki/Law_of_large_numbers) will do fine. (2) You can make your statements more quantitative: not only will the sample distribution of the mean be narrower with larger sample sizes, it will be about 70% ($\sqrt{1/2}$) of the width when the sample size is doubled, *provided* you use the standard error to measure the width. This, rather than bias, is really the crux of the matter. – whuber Nov 28 '11 at 16:21
@whuber I agree with your points. However, I think the central limit theorem is more accessible (as in more people know that the distribution of the sample mean is a normal distribution) than the law of large numbers. Thus, from a pedagogy perspective, it may be helpful to use the CLT to illustrate the point. It also helps as they can visualize the distribution shrinking in their mind as sample size goes up. Reg the width of the distribution- a question could arise as to whether lower width comes at the expense of bias. Hence, I thought that a discussion of biasedness is helpful. – varty Nov 28 '11 at 16:32
I agree that method 1 gives an narrower mean, but the sample variance is calculated on "same" data. Where as 2nd method is better because the sample variance is calculated on different data (like cross validation?)... thats what i want to compare – r00kie Nov 29 '11 at 14:57
@r00kie Calculating the sample variance on the 'same' data is not an issue at all. Having said that I do not understand how the concept of 'cross-validation' enters the picture. Cross-validation is a way to prevent over-fitting. See the answers to this qn: [Cross-Validation in Plain English](http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english) – varty Nov 29 '11 at 15:13

score 1 · Accepted Answer · answered Nov 29 '11 at 17:07

If the data are normal, the sample mean and the sample variance are independent. If the data are not normal, the covariance/correlation between the sample mean and the sample variance are $O( \kappa_3 n^{-1})$ where $\kappa_3$ is the (central) skewness of the distribution. If, for any reason, you really need the sample mean and the sample variance to be independent, then you can calculate these statistics on independent subsets of data. However, as other people noted, you will suffer efficiency losses.

Cross-validation that you've been thinking of, essentially, fights correlations between different statistics computed on the same data. There are applications where that's a dire necessity. E.g., in regression, the correlation between $\hat y_i$ and $y_i$ is given by the hat-value $h_{ii}$, the diagonal entry of the projector matrix $X(X'X)^{-1}X'$, and is $O(1)$, so you may need to suppress it when you talk about model selection or residual diagnostics (or at least control for it with degrees of freedom corrections like $n-p = n-\sum_i h_{ii}$). But there are applications where splitting the sample is a ridiculous overkill.

Calculating sample mean and sample variance on all samples vs distinct subsets

2 Answers2