0

I have a number of measurement samples of which some have 2 measurements and some have 3. I wish to make the most accurate estimation of population variance I can, and understand that ignoring data is taboo.

[Edit: More specifically, I have measured several different things through the same process. I expect them to have different means (due to different circumstances), but for each source to be normally distributed, and suspect that the nature of introduced noise (source of variance).

My understanding is that if they have the same (population) variance then that population variance should be used for confidence intervals rather than the sample variances, since by Basu's and Cochran's theorems {sample mean distance from population mean} and sample variance are independent as a special characteristic of normal distributions.

Each sample has two or three measurements--for which I have two or three numbers believed to be from the same normal distribution--and I wish to check the likelihood of the sample variances originating from the same population variance. To do this (scaled chi-squared distribution comparison), I first want to estimate the population variance.

My current suspicion is now that I should calculate one estimate for n=2 and one estimate for n=3, each group of sample variances (of the same sample size) corrected by multiplying by $\frac{n}{n - 1 + \frac{2}{SampleNumber_n}}$ before averaging. After getting two numbers, I should then do an inverse-MSE weighted average (using MSE as the variance relative to the population mean), where the 'MSE' term is according to the equation in the answer to 'Estimate of variance with the lowest mean square error' linked below, with the population-variance term cancelling in numerator or denominator.

However, though this intuitively sounds hopeful, I remember in the past often suspecting things which made intuitive sense and then turned out to in fact be incorrect. Beyond that, my streams-of-consciousness (trains of thought) can be hard to follow I am aware, and I feel that even if I were correct about this, it would be better for me to cite an authoritative source that this is a valid course of action rather than try to justify it with a shaky foundation--either that or, again if it were correct, I should improve my understanding to the point where I can be confident in explaining why it must be true.]

Relevant links are (Bessel's correction Caveats and Estimate of variance with the lowest mean square error.

I understand that $MSE = Var + Bias^2$ .

If I understand correctly, $Var(Mean) = \frac{Var}{SampleNumber}$ .

If I understand correctly, $MSE(Mean) = \frac{Var}{SampleNumber} + Bias^2$ , such that as sample number approaches infinity the distance of the mean from the population mean approaches the Bias.

In the linked question's answer, for a single sample variance, $\text {Var}(s_d^2) = 2\sigma^4(n - 1) / d^2$ . However, if I understand correctly, if SampleNumber sample variances with the same degree of freedom were averaged together, the equation for the mean's $Bias^2$ in the mean's MSE equation would be the same, whereas the mean's $Var$ would instead be $\sigma^4(n - 1) / d^2$ , and the end equation becomes $d = n - 1 + 1$, or more generally $d = n - 1 + \frac{2}{SampleNumber}$.

Returning to my situation of choosing corrections for and combining (averaging?) sample variances, if I used the $n-1$ correction to make each estimator unbiased then the course of action would be straightforward.

$$SumOfSquaresFromPopulationMean = SumOfSquaresFromSampleMean + n*(SampleMeanFromPopulationMean)$$

When the distance of the sample mean from the population mean (the bias) is 0, I can add the two sums of squares directly, then divide them by the total sample number to get a mean with proportionally-shrunk variance, without worrying abou the $Bias^2$ terms which are both equal to 0.

However, if I try to see what happens if I attempt the same for lowest-MSE corrections, I get in this sort of a tangle.

$$SumOfSquaresFromPopulationMean = SumOfSquaresFromNewSampleMean + n*(DistanceOfNewSampleMeanFromOldSampleMean) + n*(DistanceOfOldSampleMeanFromPopulationMean)$$

--which gets more of a tangle the more I try to expand it (and add the terms for the two distributions).

What to do?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
MCC
  • 67
  • 6
  • The useful part of this question seems to be the first paragraph--it's difficult to see how the rest clarifies it. Could you instead explain what your population is and describe your "measurement samples" in a little more detail? – whuber Nov 16 '19 at 16:55
  • @whuber Thank you for your response! I have tried to elaborate in the [Edit:] added; please tell me if this has not made it clearer! – MCC Nov 18 '19 at 16:29
  • You seem to describe a standard ANOVA. How do you conceive of your problem as differing from that? – whuber Nov 18 '19 at 17:30
  • @whuber My understanding of the term 'ANOVA' is of it being an opaque black-box computer program that you put numbers into and get p-values out of without any clues as to how they were generated, a p-value being {the probability of the data given the hypothesis} rather than {the probability of the hypothesis given the data} or a likelihood ratio. By contrast, I wish to understand what I am doing and be able to explain why it makes sense or not. Have I misunderstood something critical? – MCC Nov 18 '19 at 17:36
  • Yes, you have misunderstood. ANOVA is short for "analysis of variance," which literally means separating the variances of the measurement errors from the variances among the underlying means. ANOVA output conventionally includes estimates of those *variance components.* Although I cannot discern what you're actually trying to do, most of what you write sounds like you need exactly that analysis. – whuber Nov 18 '19 at 17:39
  • @whuber I am somewhat familiar with variance and bias and sums of squares and corrections (e.g. Bessel's) and averages and weighted averages and normal distributions and chi-squared distributions and degrees of freedom, but when I try to read about ANOVA it begins talking about F-tests and F-tables and things which appear like bewildering gobbledygook to me. – MCC Nov 18 '19 at 18:03
  • See https://www.itl.nist.gov/div898/handbook/prc/section4/prc44.htm for instance. – whuber Nov 18 '19 at 19:12
  • @whuber Thank you for the link. Reading more, ANOVA seems to deal with accepting or rejecting the hypothesis of equal population means, rather than accepting or rejecting the hypothesis of equal population variance, which is what I want to do at the moment. // Also, StackExchange is telling me 'Please avoid extended discussions in comments. Would you like to automatically move this discussion to chat?', but is chat synchronous (real-time) or asynchronous conversation? Asynchronous would be fine, but synchronous would not be feasible for me. – MCC Nov 19 '19 at 04:24

0 Answers0