3

I have different number of measurements from various classes. I used one-way anova to see if the means of the observations in each class is different from others. This used the ratio of the between-class variance to the total variance.

Now, I want to test whether some classes (basically those with more observations) have a larger variance than expected by chance. What statistical test should I do? I can calculate the sample variance for each class, and then find the $R^2$ and p-value for the correlation of the sample variance vs. class size. Or in R, I could do

summary(lm(sampleVar ~ classSize))

But the variance of the esitmator of variance (sample variance) depends on the sample size, even for random data.

For example, I generate some random data:

dt <- as.data.table(data.frame(obs=rnorm(4000), clabel=as.factor(sample(x = c(1:200),size = 4000, replace = T, prob = 5+c(1:200)))))

I compute the sample variance and class sizes

dt[,classSize := length(obs),by=clabel]; dt[,sampleVar := var(obs),by=clabel]

and then test to see if variance depends on the class size

summary(lm(data=unique(dt[,.(sampleVar, classSize),by=clabel]),formula = sampleVar ~ classSize))
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.858047   0.056605  15.159   <2e-16 ***
classSize   0.006035   0.002393   2.521   0.0125 *  

There seems to be a dependence of the variance with the class size, but this is simply because the variance of the estimator depends on the sample size. How do I construct a statistical test to see if the variances in the different classes are actually dependent on the class sizes?

If my the variable I was regressing against was a continuous variable instead of the ordinal variable classSize, then I could have used the Breusch-Pagan test.

For example, I could do fit <- lm(data=dt, formula= obs ~ clabel)

highBandWidth
  • 2,092
  • 2
  • 21
  • 34

1 Answers1

2

It's important to keep in mind that you are going to have to write down some model for the variances, and as the variances cannot be estimated independently of the means, you will simultaneously need a model for the means.

Are you willing to assume that the distribution of measurements within any particular class is normal? I assume so, since you analyzed the data with an anova, but this is an important assumption that affects how you can compare variance between classes.

I assume that you want to estimate the class means independently, though there's not enough information in the question to be sure that you don't treat the effect of class as random (rather than fixed). Importantly, decisions like this could affect your inference about the variances!

Now that you've narrowed down a model for the means, you need to focus down to a specific model for the data by figuring out how to model the variances. Do you want a model where class variances differ strictly according to some linear predictor (e.g. the variances are given exactly by a linear function of the class size?), or a model where each class has its own independent variance, or a model like the one you describe where the class variances are modeled as a linear regression (with homogeneous normal error) on the class size, or... ?

In general, it is hard to get good estimates for variance parameters. MCMC fitting routines often (presumably correctly) yield higher uncertainties for variance parameters than homologous frequentist estimates. Therefore, I suggest that an MCMC fitting routine with vague priors might be the best way to get the inference you seek.

The BUGS/JAGS likelihood associated with the linear regression you describe (where class means are estimated separately and class variances come from a linear regression on class size) would look like this:

for(i in 1:n_samples){
sample[i] ~ dnorm(mu[class[i]], tau[class[i]])
}

for(j in 1:n_classes){
  tau[class[i]] <- 1/var[class[i]]
  var[class[i]] ~ dnorm(m*class_size[j]+b, tau2)
}

Then you want inference on the parameter m. But this is just one highly specific model for a particular form of the means and variances!

Jacob Socolar
  • 1,768
  • 10
  • 18
  • Thank you! Is there a frequentist closed form solution assuming a normal distribution of data and linear relation between variance and class size or other ordinal variable. I also have a continuous variable for which I want to test the dependence of the variance. In each case, the data is divided into classes. So `fit – highBandWidth Mar 27 '17 at 18:31
  • I'm really not too sure about a frequentist closed-form solution. I also don't know much about possible categorical extensions of the Breusch Pagan test. It wouldn't surprise me if there is a version of that test available in standard statistical software that can handle categorical predictors, but I don't know how to find it. Notice that the Breusch Pagan test checks for association between the variance and the class mean, not between the variance and the number of observations in the class. Your OP suggests you might be interested in the latter (?) in which case B-P is not appropriate. – Jacob Socolar Mar 27 '17 at 19:49
  • As I understand it, Breusch Pagan tests association between the variance and possible predictors. Which is fine for continuous predictors, but in my case the the predictor (class-label) is categorical. I can treat class size as another predictor for BP, but I also have other possible predictors. I'm sorry the question is sort of asking two different things. – highBandWidth Mar 27 '17 at 20:14
  • Yup, you're completely right. My mistake. In a quick refresher on the Breusch Pagan test I misread "association with INdependent variables" as "association with DEpendent variables". – Jacob Socolar Mar 27 '17 at 20:17
  • I mean BP test is built on the assumption of an iid normal distribution of the residuals. All I need to know is the distribution of the residuals when the predictor is categorical. I think they are still Gaussian, but we expect different variance depending on the sample size for each category. If not, BP test should still be valid. If the variances vary for the null model, then BP test should be modified. – highBandWidth Mar 27 '17 at 21:11
  • You're right that the influence of sample size on sample variance does not interfere with our ability to assume (if we think our assumption is justified) that the population variance are equal. This equality of population variance is what is assumed by anova, linear regression, etc. It is also the null model against which BP tests. There shouldn't be any additional issues applying BP to data with a categorical predictor, as far as I know. – Jacob Socolar Mar 30 '17 at 05:08