Testing multivariate data for normality (in R)

Question

I have data that I want to run an ANOVA on, but I need to test it for normality. Do I test the whole dataset for normality or each subset of the data associated with a unique group level? For example I have fish counts for a species at different sites and during different seasons and years. Would I test the entire dataset for normality or would I test the fish counts associated with each season individually? The whole dataset has 47 sites, 2 seasons (dry/wet), and 15 years.

> head(df)
  site season year  species_name num
1    1    dry 2019 Sailfin molly  11
2    2    dry 2019 Sailfin molly   7
3    3    dry 2019 Sailfin molly   9
4    4    dry 2019 Sailfin molly   7
5    5    dry 2019 Sailfin molly  12
6    1    wet 2019 Sailfin molly   0

does this answer your question? https://stats.stackexchange.com/questions/6350/anova-assumption-normality-normal-distribution-of-residuals — rep_ho, Nov 05 '21 at 20:00
You don't formally test any of it. However, it is good practice to *check for important violations* of your assumptions. For that purpose it can be more useful to check each group separately, *provided the groups have enough data to permit such checking.* Otherwise, you have little recourse other than combining the ANOVA *residuals* to examine them as a group. That brings out an important point: ANOVA *never* assumes the data are Normal. The strongest meaningful assumption is that the *within-group variation* is Normal (and of constant scale across the groups). — whuber, Nov 05 '21 at 20:09
Ok, that's interesting. So count data for dry and wet season (on their own) should be normally distributed, not count data of both seasons combined? We only care about the whole dataset (season data combined) to test for violations of homoscedasticity? — Nate, Nov 05 '21 at 20:16
So, I would have to do a Kolmogorov-Smirnov (K-S) normality test or Shapiro-Wilk’s test on the dry and wet season count data separately? — Nate, Nov 05 '21 at 20:19
the count data are not going to be normally distributed anyway, since the normal distribution includes all real numbers, not only integers, and it not bounded by 0. Sometimes this is a problem, sometimes it is not. Maybe a better question for you would be how to analyze your data, and not just how to test for normality. — rep_ho, Nov 05 '21 at 21:14
Ya, I figured it wouldn't be. It's a bad example. I guess the real root of my question is: 'is a dataset LIKE this, with multiple factors, 1 dataset to test for normality, or "many", and each level needs to be tested for normality? — Nate, Nov 06 '21 at 13:03

Testing multivariate data for normality (in R)

0 Answers0