2

My dataset contains a set of samples from a set of normal RVs. Each RV is normally distributed with equal variances and varying means. However, I have only two samples from each RV.

How to estimate the variance in this case?

2 Answers2

2

These data can be described by two variables: one, a categorical variable $x$, identifies each random variable. Another, $Y$, gives an observation in the sample. Thus, in a tabular rendering of your dataset you would see two columns--one for the sample and another for the result--and two rows for each sample.

Your model allows the mean $\mu$ to vary with $x$:

$$Y(x) \sim \operatorname{Normal}(\mu(x), \sigma^2).$$

Equivalently,

$$E[Y(x)] = \mu(x) + \varepsilon(x)$$

where the $\varepsilon(x)$ are independent and identically distributed Normal$(0,\sigma^2)$ variables. This is the standard regression setting.

Arbitrarily writing one observation from each sample of the random variable $x$ as $y_1(x)$ and the other as $y_2(x),$ the (unbiased) least squares estimate of $\sigma^2$ is

$$\hat\sigma^2 = \frac{1}{2n}\sum_{x} (y_1(x) - y_2(x))^2.$$

In retrospect this is obvious because $y_1(x)-y_2(x)$ have Normal$(0,2\sigma^2)$ distributions and are independent.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • This is what I needed. I'm just still not sure how you derived the (unbiased) least squares estimate of the variance. Similarly, how would that generalize if you had 3 samples from each RV as opposed to only 2? – Marcelo Mattar Mar 21 '18 at 22:54
  • 2
    The result is a straightforward application of the usual least squares formulae: see https://stats.stackexchange.com/questions/20227 or https://stats.stackexchange.com/questions/76738, for instance. Alternatively, you could look up standard ANOVA formulae. The unbiasedness of this estimator of $\sigma^2$ is evident because the expectation of $(Y_1(x)-Y_2(x))^2$ is the variance of a Normal$(0,2\sigma^2)$ variable, whence the sum of $n$ such expectations is $n$ times this variance, or $2n\sigma^2.$ – whuber Mar 21 '18 at 23:23
0

Recall (population) variance is a measure of variability around a (population) mean.

Your dataset contains several sets of two samples from each RV, for which you know the variances are equal.

We first must estimate the sample mean, and then use that mean to estimate the sample variance.

The problem is, we have to 'spend' some observations to estimate the mean and then 'spend' further observations to estimate the sample variance.

At minimum, you'd need two points to calculate an average, and at least one more point to estimate the squared deviation of each point from the sample average.

  • 1
    Your conclusion "at minimum..." may be misleading. A single sample of two will suffice to make *some* estimate of the variance. When multiple samples are available, those estimates can be combined because they all independently estimate the same quantity. – whuber Mar 21 '18 at 21:41