What is an appropriate test for a normally distributed, heteroscedastic, multi-factor data set?

Question

I have a data set of active layer depths from an Arctic field site. There are two factors in the data set, Month measured (July or August), and Location (shrub patch or open tundra). I had intended on running a two-way ANOVA to test for differences between treatments so I tested the model assumptions graphically. The Q-Q plots indicate normality, but the plot of the residuals against the fitted y indicate heteroskedasticity. I should also note that interaction plots suggest essentially no interaction between factors.

I have tried a log transformation, which seems to create equal variance, but the distribution becomes non-normal. Before I start randomly applying transformations, I wanted to explore alternatives. My understanding is that there is a non-parametric alternative to a two-way ANOVA, but I am wondering if there is a more effective tool. For example is there an appropriate way of using a generalized linear model (GLM) to solve this problem? I am new to the world of GLMs but I have been told in the past they are more versatile.

I have added a box plot of the data to clarify.

Would you describe your target or dependent variable, "active layer depths?" This would help to illuminate some possible next steps. Of all of the assumptions about regression and ANOVA, violations of normally distributed errors -- heteroscedasticity -- is one of the weaker ones since all Gaussian-based tests are robust to it. So this may or may not be problematic. Regardless, there are any number of other approaches to consider including GLMs, splines, quantile regression, etc., as a function of how your DV is scaled and specified. — Mike Hunter, Nov 02 '15 at 15:54
Sure, it is essentially just the depth from the top of the soil to the frozen part of the soil. You just measure it by seeing how far you can stick a metal rod into the ground. You can think about it like depth to permafrost if that makes more sense. It's technically not the same thing, but the distinction probably isn't important for this question. It ranges from about 15cm to 110cm at our site. Does that clarify it a bit? — C W, Nov 02 '15 at 16:31
Given that the values for your DV are all greater than zero and are integers, it might make sense to consider a poisson (variance equals the mean) or negative binomial model (in the more likely case of overdispersion). — Mike Hunter, Nov 02 '15 at 16:38
Depth is a **measured** variable here, in principle continuous. (I see nothing about integers....) Why July vs August? Does this reflect e.g. the small print of two different visits at each site? Normally I would expect from the literature monitoring of sites at least daily during a field season. I agree with the idea of a GLM with log link; I guess wildly at a physical fact that the top layer is frozen in winter, but then there is nothing to study, so there being zeros in principle when you don't measure it will not bite. Can you show (give and/or plot) the data? — Nick Cox, Nov 02 '15 at 17:09
@nickcox Please *clearly* motivate why an integer model wouldn't work in this case — Mike Hunter, Nov 02 '15 at 17:20
The data are in principle not integers as depths could be 23.4 cm or whatever: please tell me if that needs further explanation. I did **not** say, however, that an integer model wouldn't work; I suggested a GLM model with log link and it is well known that using log link is more important than the distribution family postulated. See e.g. http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/ for reasoning why a Poisson model could be a reasonable first approximation even for a response like income, because the crucial idea is just $y = \exp(Xb)$. — Nick Cox, Nov 02 '15 at 17:27
The measure of depth is "continuous" but not normally distributed since all of the values are greater than zero. 23.4 cm is convertible to 234 mm, retaining the integer value. — Mike Hunter, Nov 02 '15 at 18:42
It's one thing to say that a log-linear model [the terminology "Poisson" doesn't help] might work here (as said, I agree) and quite another to say that all measurements are really counts. In a sense yes, but the fiction here looks like wishful thinking By the same argument 23.45 cm is "really" 2345 with units 0.1 mm. Personally I wouldn't want any trick to depend on a convention about resolution of measurement. Note that the variance of a length cannot have the same units as the mean of a length; this turns out not to matter much, but analysts might feel queasy at the sleight of hand. — Nick Cox, Nov 02 '15 at 18:58
It would be completely inaccurate to suggest that all measurements are counts or even to suggest it. My point is that fine, theoretical distinctions between ordinal and interval scales are fungible and frequently overridden by practical considerations in predictive modeling. For instance, a Likert-type scale is inherently ordinal but has frequently been assumed to be "continuous" in taking averages and applying t-tests, etc. Similarly, unit sales is a count but, for many practical applications, is considered "continuous." — Mike Hunter, Nov 02 '15 at 19:47
I am not particularly dogmatic about measurement scales, and I do agree that a counted scale with many distinct values is often reasonably regarded as continuous; I am queasier about going in the other direction. But counts and depths (in this case) are both ratio scales, so what is, or is not, ordinal or interval is something else altogether. In this case, I'd like to see the data and/or various explicit analyses and then we could make informed comments about which assumptions seem crucial and which incidental in practice. — Nick Cox, Nov 02 '15 at 20:03
@NickCox July and August were the months we measured and likely the most important for the relationship between frost table depth and the plants we are studying. If the primary goal of the study was to understand frost table changes then we would try to measure daily but in our case we don't need that kind of detail. Does that answer your question? — C W, Nov 03 '15 at 19:07
So, one measurement each per month per site? Can you list the data themselves? I see four groups and I would certainly try a GLM with log link first. Note that a clear implication of the box plots is that 25% of the values of August depth are in rather small intervals, between the minimum and the lower quartile. Are the replicates in each group on (e.g.) a topographic profile? — Nick Cox, Nov 03 '15 at 19:18
I agree with @AlaskaRon's answer below that a weighted LS ANOVA is the way to go for your data. For an overview of possible approaches in this situation, it may help you to read my answer here: [Alternatives to one-way ANOVA for heteroskedastic data](http://stats.stackexchange.com/a/91881/7290). — gung - Reinstate Monica, Nov 16 '15 at 18:09

AlaskaRon · Answer 1 · 2015-11-17T00:45:23.473

What is your sample size? If it is fairly large, you will be robust to non-normality unless there are extreme outliers in the data (mild non-normality is one of the most ignorable assumptions). An alternative to transforming the data is to weight the observations in each cell (combination of factors) by $weight = 1/\sigma^2$, where the variance is estimated by taking the $s^2$ of values in that cell. Again, there is an old rule of thumb for balanced ANOVA that says you don't have to worry much about non-constant variance unless some cells have standard deviations greater than twice that in other cells.

The integer-response GLM models mentioned above aren't appropriate (you have, it appears, positive-valued continuous response), such as Poisson regression, Logistic Regression, Negative Binomial. However (thanks for jogging my memory, Nick) you MIGHT consider a GLM with Gamma family, since that is compatible with non-constant variance, but (in R at least) it has a fairly rigid model for the variance (assumes a constant shape parameter) so it might not be any better fit. It also might be harder to interpret and explain to others than a weighted ANOVA (or even unweighted ANOVA), so for practical reasons I'd lean toward the ANOVA and not a GLM.

EDIT: Now that I see your data, I'd suggest that you try a transformation, such as $y_i^* = \sqrt{y_i}$, then apply regular ANOVA. This is because categories with larger means are associated with larger variances and the square root transformation will often correct that (in general, the Box-Cox transformation can be used).

I meant that Logistic, Poisson and Negative Binomial don't apply. Of course you can get a GLM with just about any exponential family, it's just that other commenters were bringing up integer-valued response models, which don't apply to the OPs data. — AlaskaRon, Nov 02 '15 at 21:05

What is an appropriate test for a normally distributed, heteroscedastic, multi-factor data set?

1 Answers1