Estimating distribution characteristics from characteristics of multiple samples

Question

Definition

Suppose that $X \sim D(\mu, \sigma)$, where $D$ is a 1D distribution (generating from $\mathbb{R}$) with mean $\mu$ and stddev $\sigma$. If I sample $X$ (the random variable) $M\times N$ times, receiving a matrix $A$ (with $M$ rows and $N$ columns), then generate

a vector $V_\mu$ from $A$ by reducing each row of $A$ to a single value by calculating the mean of that row, and
a vector $V_\sigma$ from $A$ by reducing each row of $A$ to a single value by calculating the biased standard deviation of that row from the mean --> single number of $V_\sigma = \sqrt{\frac{1}{N}\sum_{i=0}^{N}(x_i - \overline{x})^2}$,

how do I properly estimate the original $\mu$ and $\sigma$ from $V_\mu$ and $V_\sigma$? It's clear that the estimate of $\mu$ is just the mean of $V_\mu$, but how do I handle the $\sigma$?

Sigma

I've tried experimenting, deciding that, for starters, let's assume that $D = \mathcal{N}$ and $X \sim \mathcal{N}(\mu=3, \sigma=5)$. I generated $A$, computed unbiased $\sigma(A)$ and then reduced $A$ to $V_\sigma$, from which I calculated the mean. This is the result:

Obviously, the result is biased, the estimate undershoots the true $\sigma$ more often than not. If I instead multiply each member of $V_\sigma$ by $\frac{N}{N-1}$ before calculating the mean of $V_\sigma$, I get:

which overshoots the true $\sigma$ more often than not, so I'm really at a loss here. I've also tried multiplying $V_\sigma$ by $\frac{MN}{MN-1}$ instead, but this yields

which is slightly better than the original, but still heavily biased.

EDIT: Thanks to Ryan for pointing out my mistake, of course, I forgot to square root the correction factor. Still, I had no idea that the $c_4$ factor should be taken into account as well. By multiplying the mean of $V_\sigma$ by $\sqrt{\frac{N}{N-1}}$ and also by $1/c_4(M)$, I've obtained:

which is unbiased, but distributed more uniformly (with larger standard deviation).

Mu

The mean estimate is good, as expected:

Question

After everything that I've shown, what I wonder is this:

Given $D = \mathcal{N}$, what are these distributions I'm observing? They look normal, but aren't they t? EDIT I know now that the distribution for mean estimate, if we subtract the true $\mu$, is a $t$ distribution with $N-1$ degrees of freedom.
SOLVED COMPLETELY Given $D = \mathcal{N}$, how do I correct the $\sigma$ estimate from $V_\sigma$? EDIT: answered by Ryan, see Sigma section.
Given $D = \mathcal{N}$, after I correct the $\sigma$ estimate, is it okay to claim that $D$ is probably $\mathcal{N}(\mu, \sigma)$? Certainly, the larger $M$ and $N$ get, the more confident I can be while claiming such a fact, right? What is the proper statistical procedure I should execute after I get my estimate of $\mu$ and $\sigma$? For example, according to my experiment, I can see that the $\mu$ estimate falls into $[2, 4]$ about 95% of the time. EDIT: I know now that if I generate the interval as $\overline{x} \pm 1.96 \cdot \sigma \cdot \sqrt{N}$, the interval will contain $\mu$ 95% of the time. But what about $\sigma$? And when I finally settle on some interval estimates of $\mu$ and $\sigma$, can something about $P(X > c), X \sim \mathcal{N}(...)$ be said?
Can this problem be solved for general $D$, that is, obtain $V_\mu$ and $V_\sigma$ from a bunch of samples ($M\times N$ to be specific) and conclude something about the true $\mu$ and $\sigma$?

Ryan Volpi · Accepted Answer · 2021-04-07T23:51:52.280

Welcome to CV!

(1) Sampling Distribution of Sample Statistics

Given $D = \mathcal{N}$, what are these distributions I'm observing? They look normal, but aren't they t?

a) Sample mean $\bar{X}$

$\frac{\bar{x}-\mu}{S/\sqrt{n}}\sim t_{n-1}$ - t distribution with $n-1$ degrees of freedom.

b) Sample variance $S^2$

$\frac{(n-1)}{\sigma^2}S^2 \sim\chi^2_{n-1}$ chi-squared distribution with $n-1$ degrees of freedom (see Sampling Distribution of Sample Variance)

c) Sample standard deviation $S$

$\sqrt{\frac{(n-1)}{\sigma^2}}S \sim\chi_{n-1}$ - chi distribution with $n-1$ degrees of freedom. This follows from the fact that if $X\sim \chi(n)$ then $X^2\sim \chi^2(n)$ (see Wikipedia: Chi Distribution)

(2) Unbiased Estimate of Population $\sigma$

Given $D = \mathcal{N}$, how do I correct the $\sigma$ estimate from $V_\sigma$?

The sample variance with Bessel's correction ($\tfrac{n}{n-1}$) provides an unbiased estimate for the population variance. Two reasons that statement doesn't help you.

You are applying Bessel's correction $\frac{n}{n-1}$ to the sample standard deviation. In fact, you would want to multiply the sample standard deviation by $\sqrt{\frac{n}{n-1}}$ to apply the correction.
Even then, you will not get an unbiased estimate of the sample standard deviation. The corrected variance is unbiased, but the square root of that value is not an unbiased estimate of population standard deviation. See Wikipedia and related question. In the case where $D = \mathcal{N}$, there is a correction factor ($c_4(n)$) you can apply. It is discussed in the wikipedia article linked above. For the case where $n=10$, the correction looks like $c_4(10)= \left(\frac{128}{105}\sqrt{\frac{2}{\pi}}\right)\approx 0.9726592741$.

In general, an unbiased estimate for the population standard deviation where $D = \mathcal{N}$ is given by $$\hat{\sigma}=\frac{1}{c_4(n)}\sqrt{\frac{\sum_{i=1}^N(x_i-\bar{x})^2}{N-1}}$$

Here is a quick plot to show the difference in the estimated standard deviation using the two corrections on samples from a normal distribution as well as Python code to reproduce the plot.

from math import gamma
import seaborn as sns
import pandas as pd

SIGMA = 5
MU = 3
m = 10000
# calculate correction
def c4(n):
    return np.sqrt(2/(n-1)) * gamma(n/2) / gamma((n-1)/2)

# calculate statistics for various N
results_dict = {x:[] for x in ['N','correction','s']}
for N in range(3, 25):
    A = np.random.normal(loc=MU, scale=SIGMA, size=[m,N])
    df_i = pd.DataFrame()
    results_dict['N'] += [N]*m*3
    results_dict['correction'] += ['None']*m+['Bessel']*m+['Bessel + c4']*m
    results_dict['s'] += list(np.std(A, axis=1))
    results_dict['s'] += list(np.std(A, axis=1)* ((N/(N-1))**0.5) )
    results_dict['s'] += list(np.std(A, axis=1)* ((N/(N-1))**0.5) / c4(N))

# create dataframe
results_df = pd.DataFrame(results_dict)

# plot results
plt.figure(figsize=(8,6))
sns.pointplot(
    data=results_df,
    x='N',
    y='s',
    hue='correction',
    ci=None
)
plt.title("Comparison of statistics for estimating $\sigma$")
plt.axhline(5, c='k', linestyle='--', label= "$\sigma$")
plt.show()

(3) Confidence Intervals - Normal

Given $D = \mathcal{N}$, after I correct the $\sigma$ estimate, is it okay to claim that $D$ is probably $\mathcal{N}(\mu, \sigma)$? Certainly, the larger $M$ and $N$ get, the more confident I can be while claiming such a fact, right? What is the proper statistical procedure I should execute after I get my estimate of $\mu$ and $\sigma$? For example, according to my experiment, I can see that the $\mu$ estimate falls into $[2, 4]$ about 95% of the time.

You of course can't say that $D$ is probably exactly $\mathcal{N}(\bar{X}, S)$ but you can construct confidence intervals for $\mu$ and $\sigma$.

As an aside, the maximum likelihood estimator for the variance is actually the uncorrected version $s^2=\frac{1}{N}\sum_{i=1}^N(X_i-\bar{X})^2$ (see MLE Biased). This is regardless of the fact that the uncorrected estimate tends to underestimate the true value. And if $S^2$ is the MLE estimate for $\sigma^2$ then $\sqrt{S^2}=S$ is the MLE estimate for $\sqrt{\sigma^2}=\sigma$ (see Maximum Likelihood Estimation) We can also see, using our simulation, that the average squared difference between our estimate $S^2$ and the population variance $\sigma^2$ is lowest for the uncorrected estimate.

# Variance
results_df['s2'] = results_df['s']**2
# Variance error
results_df['s2_mse'] = (results_df['s2']-SIGMA**2)**2

plt.figure(figsize=(8,6))
sns.pointplot(
    data=results_df,
    x='N',
    y='s2_mse',
    hue='correction',
    ci=None
)
plt.ylabel("$(S^2-\sigma^2)^2$")
plt.title("Squared Error of statistics for estimating $\sigma^2$")
plt.axhline(5, c='k', linestyle='--', label= "$\sigma$")
plt.show()

You can construct the following confidence intervals for your sample statistics.

a) Population mean $\bar{X}$

A $(1-\alpha)%$ confidence interval for the population mean is

$$\left( \bar{X}-\frac{S}{\sqrt{n}}t_{n-1,\alpha/2} \leq \mu \leq \bar{X}+\frac{S}{\sqrt{n}}t_{n-1,\alpha/2} \right)$$

see: Confidence Intervals with σ unknown

b) Population variance $\sigma^2$

A $(1-\alpha)%$ confidence interval for the population variance is $$\left(\frac{(n-1)s^2}{\chi^2_{\alpha/2,n-1}} \leq \sigma^2 \leq \frac{(n-1)s^2}{\chi^2_{1-\alpha/2,n-1}} \right)$$

(see Confidence Intervals for Variances)

c) Population standard deviation $\sigma$

A $(1-\alpha)%$ confidence interval for the population standard deviation is $$\left(\sqrt{\frac{(n-1)s^2}{\chi^2_{\alpha/2,n-1}}} \leq \sigma^2 \leq \sqrt{\frac{(n-1)s^2}{\chi^2_{1-\alpha/2,n-1}}} \right)$$

See Confidence Intervals for Variances again or this related question

(4) General Case

Can this problem be solved for general $D$, that is, obtain $V_\mu$ and $V_\sigma$ from a bunch of samples ($M\times N$ to be specific) and conclude something about the true $\mu$ and $\sigma$?

@Ben's answer seems most relevant to this question. A similar procedure you may be able to apply in the general case and avoid any analytical computation is to utilize the bootstrap. You can certainly use the bootstrap procedure to estimate the distribution of the sample mean, but I am not sure how well it applies to estimating sample variance. I'm not finding a lot of specific discussion on this question, but this thesis seems to discuss the issue in depth. Evaluation of Using the Bootstrap Procedure to Estimate the Population Variance

Wow, that's so cool! Thank you very much Ryan, I'm checking the articles and updating my code. — Captain Trojan, Apr 01 '21 at 19:24

score 2 · Answer 2 · answered Apr 04 '21 at 10:04

Your notation for this problem is problematic, so I am going to use a different notation for the same thing. I will denote the vector of subsample means as $\bar{\mathbf{x}} = (\bar{x}_1, ..., \bar{x}_M)$ and $\mathbf{s} = (s_1,...,s_M)$. I also note that each subsample is of size $N$, so each of the elements of these vectors uses $N$ data points.

Since the values in your initial $M \times N$ matrix are IID values with mean $\mu$ and standard deviation $\sigma$, the best thing to do here is just to pool the $M$ subgroups into a single sample of size $MN$ and use this to estimate the mean and standard deviation parameters. If you are willing to use the original data in your matrix, this problem is fairly simple --- you just compute the overall sample mean and sample standard deviation and use standard estimation methods for the parameters. However, if you particularly want to start from the vectors $\bar{\mathbf{x}}$ and $\mathbf{s}$ for the subgroups, you can use mathematical rules for pooling subgroup moments to get the overall sample mean and sample standard deviation without using the underlying data values (see below for how to do this in R).

In the case where your initial data is IID normal data, the sample mean follows a normal distribution and the sample standard deviation follows a scaled chi distribution. This is true both for the subgroup sample moments and the overall pooled sample moments. In the more general case where you don't assume normality of the underlying values (but you assume finite kurtosis), the central limit theorem applies and you get similar distributions, though the latter may be adjusted for kurtosis (see e.g., here).

There are well-known confidence interval formulae for the true mean and standard deviation parameters given the sample mean and sample standard deviation (and maybe also the sample kurtosis). You can find derivation of the confidence intervals for the mean and variance, plus some related information on moments, in O'Neill (2014). (Note that this particular paper gives a confidence interval for the variance/standard deviation that takes account of the kurtosis of the population, either through a known kurtosis parameter or the sample kurtosis of the data; other sources give a simpler formula that implicitly assumes mesokurtosis of the underlying data.)

Computing pooled sample moments from subgroup moments: Computing the sample moments for pooled datasets composed of subsamples has been automated in the sample.decomp function in the utilities package (see package documentation). This function can compute pooled sample moments from subgroup moments up to fourth order (i.e., up to sample kurtosis). Here we give an example where we use the function to compute the sample moments of the pooled sample for $M=6$ subgroups each composed of $N=200$ standard normal data points. As you can see from the code below, we input the sample sizes, sample means and sample standard deviations into the function, and then compute the moments of the pooled sample.

#Show sample statistics for the subgroups
library(utilities)
N      <- c(200, 200, 200, 200, 200, 200)
MEAN   <- c(0.0556434, 0.0153109, 0.0722623, 0.1211588, 0.0152080, 0.0801092)
SD     <- c(0.9977933, 0.9315480, 1.0310567, 1.0109557, 0.9731961, 0.9554002)

#Compute sample decomposition
sample.decomp(n = N, sample.mean = MEAN, sample.sd = SD, include.sd = TRUE)

              n sample.mean sample.sd sample.var
1           200  0.05564340 0.9977933  0.9955915
2           200  0.01531090 0.9315480  0.8677817
3           200  0.07226230 1.0310567  1.0630779
4           200  0.12115880 1.0109557  1.0220314
5           200  0.01520800 0.9731961  0.9471106
6           200  0.08010920 0.9554002  0.9127895
--pooled-- 1200  0.05994877 0.9825550  0.9654142