I have different number of measurements from various classes. I used one-way anova to see if the means of the observations in each class is different from others. This used the ratio of the between-class variance to the total variance.
Now, I want to test whether some classes (basically those with more observations) have a larger variance than expected by chance. What statistical test should I do? I can calculate the sample variance for each class, and then find the $R^2$ and p-value for the correlation of the sample variance vs. class size. Or in R, I could do
summary(lm(sampleVar ~ classSize))
But the variance of the esitmator of variance (sample variance) depends on the sample size, even for random data.
For example, I generate some random data:
dt <- as.data.table(data.frame(obs=rnorm(4000), clabel=as.factor(sample(x = c(1:200),size = 4000, replace = T, prob = 5+c(1:200)))))
I compute the sample variance and class sizes
dt[,classSize := length(obs),by=clabel]; dt[,sampleVar := var(obs),by=clabel]
and then test to see if variance depends on the class size
summary(lm(data=unique(dt[,.(sampleVar, classSize),by=clabel]),formula = sampleVar ~ classSize))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.858047 0.056605 15.159 <2e-16 ***
classSize 0.006035 0.002393 2.521 0.0125 *
There seems to be a dependence of the variance with the class size, but this is simply because the variance of the estimator depends on the sample size. How do I construct a statistical test to see if the variances in the different classes are actually dependent on the class sizes?
If my the variable I was regressing against was a continuous variable instead of the ordinal variable classSize, then I could have used the Breusch-Pagan test.
For example, I could do fit <- lm(data=dt, formula= obs ~ clabel)