Pooled (or combined?) variance - is it suitable to use in the below scenario?

Question

I have a dataset with measurements and their deviations from nominal value. These measurements have been made on 3 different categories of a component. The components differ in physical properties (size, weight, volume etc.) Below are some more details:

3 categories: A,B,C
Count of number of observations for each category: $n_A$ = 2025, $n_B$ = 13507,$n_C$ = 21511
Each component has it’s own nominal value : 0.8, 0.6, 1.00 respectively
The measurements & deviations can be between -${\infty}$ & +${\infty}$
The simple means of the measurements : $\overline{x}_A$ = 0.976, $\overline{x}_B$ =0.908 , $\overline{x}_C$ =0.806
The standard deviation of the measurements : $\sigma_A$ = 0.062, $\sigma_B$ =0.069, $\sigma_C$ = 0.062
The simple means of the deviations from nominal value : $\overline{d}_A$ = 0.062, $\overline{d}_B$ = 0.035, $\overline{d}_C$ = -0.024
The data for each of the categories is close to normal (visual inspection using normal qq plot)
I have performed Bartlett's test for homogeneity of variances. There are significant differences between the sample variances.

The aim is to find a single statistic which can be used as a single representative value for the variances of the 3 categories. In the sense that if tomorrow the this single statistic increases (or decreases) then it can be used an indication to investigate further for the causes of variation.

I have gone through several questions on Cross Validated related to the topic of pooled variance and this one, I believe, is very close to the question I am asking. However, I am not sure if the formula given in the answer can be applied to my case. If not, then which statistic can I use in this case? If yes, then are there any assumptions that I should check for violation before applying the formula?

I cannot post the entire data, but I have attached a screenshot of a proportion of the data to give you an idea of what it looks like.

Christian Hennig · Answer 1 · 2021-09-25T08:55:23.923

Assuming that the three true underlying variances are different from each other, there's no "correct" statistical method to aggregate them into one number. From a statistical modelling perspective you should have all three.

If for whatever reason you want a single number (index) anyway, this needs to involve decisions regarding the subject matter. There is more than one approach that could make sense.

You could compute a mean of all the category standard deviations. This would give all the standard deviations the same weight, indicating that they all have the same importance for whatever decision you want to make based on them.
You could compute the mean of all category variances. Same characteristics as before, but slightly different, because this gives higher weight to the larger standard deviations/variances.
You may want to weight by number of observations (you may think that the more observations there are, the more relevant a value is; also the more observation there are, the more reliable are the individual variances/sds). One way of doing this is to add up the squared differences from the respective category mean for all observations and divide by $n-3$ or $n$ (chances are the last thing doesn't make a relevant difference).
If you want to pick up situations where at least one category variance changes strongly, you could look at the maximum change out of all three from one time point (or some initial reference value) to the next.
You may want to bring in information from nominal values and deviations from them somehow, even though without understanding of the meanings of those and the exact aim of computing the index, I have no idea how.

The thing that you need to understand here is that the task is not to estimate some statistically assumed "truth", but rather that you need to decide how to aggregate the information that you have in order to be fit for the purpose you have in mind. (One could trace this back to a statistical modelling problem if there was a more detailed model about how the values are produced, and what kind of problem the index is meant to detect, but that seems out of reach based on the given information.)

You may also want to ask yourself whether you could make things work monitoring all three standard deviations separately rather then breaking them down to a single number.

Pooled (or combined?) variance - is it suitable to use in the below scenario?

1 Answers1