1

I'm trying to estimate the sampling distribution of a variance from a single sample to do a certain statistical test.

To test the proof of concept, I take a normal distribution $N(200,5)$ with mean 200 and variance 25.

I then pick random samples of size sample_size from the normal distribution and calculate the variance of those samples to construct a sampling distribution of the variance.

My understanding is that the sampling distribution of the variance should follow a $\chi^2(\mathrm{sample~size} -1)$ distribution.

However, when I plot a PDF of the $\chi^2(\mathrm{sample~size} -1)$ distribution over my histogram of sample variances, the results do not agree.

The fit improves with increasing sample size but never truly "fits".

Where am I going wrong? Python MWE below.

#!/usr/bin/env python                                                                                                                                                             

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

#Setup population with known variance to test                                                                                                                                     
mu = 200
variance = 25
sigma = np.sqrt(variance)
N = 1000000
d = np.random.normal(mu, sigma, N)

n_samples = 100000

def sample_variance(sample_size):

    #Take many random samples of sample_size from the distribution d above.                                                                                                       
    sample_vars = []
    for i in range(n_samples):
        sample = np.random.choice(d, sample_size)
        sample_var = np.var(sample,ddof=1)
        sample_vars.append(sample_var)

    sample_vars = np.array(sample_vars)
    std_mean = np.mean(sample_vars) #The mean of the sample variances correctly reproduces the population variance                                                                
    std_std = np.std(sample_vars) #Calculated just for plotting purposes.                                                                                                         

    plt.hist((sample_vars), bins=30, density=True)


    xs = np.linspace(std_mean-4*std_std,std_mean+4*std_std,200) #Just for x-scale plotting

    #Plot chi^2 distribution with degrees of freedom = sample_size - 1                                                                                                            
    plt.plot(xs,stats.chi2.pdf(xs,df=sample_size-1))

    plt.plot((std_mean,std_mean),(0,0.30),'--')
    plt.xlim(0,50)
    plt.xlabel('Variance of sample')
    plt.ylabel('Density')
    plt.show()

sample_variance(30)
sample_variance(7)

EDIT: For clarity, assume I have no knowledge of the population variance since this is what I'm trying to estimate after I get this test case working.

Sample Size 7 Sample Size 30

user1654183
  • 121
  • 1
  • 3
  • With $\nu$ = N-1 = sample size - 1, try overplotting with $\Gamma(\nu/2 , 2\sigma^2 /\nu)$. So for N = 7, use $\Gamma(3, 8.3333)$ and for N = 30, use $\Gamma(14.5, 1.72414)$. There is a nice answer at this site that shows how the gamma distribution is related to the chi square distribution. As promised: https://stats.stackexchange.com/q/118676/247352 – Ed V Aug 03 '19 at 00:17
  • 1
    A slightly different approach: For normal data $Q = \frac{(n-1)S^2}{\sigma^2}\sim\mathsf{Chisq}(n-1),$ where $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2.$ You might start by showing a histogram of $Q$ along with the density of $\mathsf{Chisq}(n-1),$ Then find the density of the dist'n of $S^2.$ – BruceET Aug 03 '19 at 01:00
  • "My understanding is that the sampling distribution of the variance should follow a χ2(sample size−1) distribution." -- to clarify: this understanding is incorrect since it omits the population variance; previous commenters supply good information but perhaps weren't completely explicit about what was wrong. – Glen_b Aug 03 '19 at 01:40
  • If I have understood correctly these methods require knowledge of the population variance $\sigma$, which is the quantity I am trying to estimate? – user1654183 Aug 03 '19 at 12:18
  • You knew $\sigma $ when you generated your test histograms, so, if you are simply testing the theoretical pdfs, you still know $\sigma $. Later, with real histogram data, you would have to estimate $\sigma $. – Ed V Aug 03 '19 at 12:41
  • By the way, I re-did all the sims, as per your parameters, and overplotted the histograms as per my comment and I also did exactly what @BruceET said to do, with histograms of “Q” values. Everything worked very well and the two variance histograms you posted are correct. What is wrong is what you used as the pdfs, i.e., the orange curves. – Ed V Aug 03 '19 at 13:13
  • Hi Ed - that's exactly what I'm trying to do: estimate $\sigma$. I just generated a normal dist with known variance to test the general procedure. The idea I have is to develop a general procedure for fitting estimate $\sigma$ with a single value and then create a theoretical sampling distribution from that value. – user1654183 Aug 03 '19 at 19:34

0 Answers0