1

I am working with an ANOVA model. I want to run a fixed effects ANOVA in which I have a ratio dependent variable and three independent variables with two and three levels. Obviously, before analyzing the results, I want to check the assumptions of factorial ANOVA. Reviewing some handbooks, I found some divergences in their explanations about assumptions of ANOVA. Moreover, I have doubts about the underlying assumptions of factorial ANOVA. The major ones are:

  1. All of the handbooks that I check point out that the dependent variable in ANOVA models should be at least an interval variable. I work with a count variable, in fact converted to a ratio variable (i.e. a percentage). So, is ANOVA appropriate in this case?

  2. All of the handbooks stress the importance of checking the assumptions of the ANOVA model for inference: mainly a) independence, b) normality, and c) homogeneity of variances. However, they examine this aspects in different ways. Some of them check the data, i.e. the independence of cases, normality of each group and homogeneity of variances between groups. But others examine only the residuals (error) derived from the analysis (i.e. the independence, normality and homoscedasticity of the residuals).

So, I am confused about the appropriateness of my approach, but also about the assumptions that I should review. What does the ANOVA model require? Parametric assumptions for variables, only for residuals or both? References are welcome.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Holaquetal
  • 11
  • 1
  • 2
  • 3
    1. [Simply no.](http://bodowinter.com/courses/papers/jaeger_2008.pdf) The other question is then obsolete. – Henrik Mar 07 '14 at 10:03
  • Thank you this paper. It is very interesting! But I still doubt... Lots of articles in psichology use proportional data (like %) as dependent variables in ANOVA after transform the data. Tipical transformations are square root or arcsine transformation. The new variable adopt a normal distribution. Therefore, the problem of the range of estimation (0-100) its solved too. So, I still doubt about what is wrong? Or why it's correct? Moreover, I am interested in understand which are the important assumptions in ANOVA (in anyway case). – Holaquetal Mar 07 '14 at 11:03
  • 3
    Back in the day I used to use transform proportions $y$ to $\operatorname{asin} \sqrt y$ - but I have an electronic computer now. – Scortchi - Reinstate Monica Mar 07 '14 at 11:13
  • Your second question is answered [here](http://stats.stackexchange.com/questions/6350/anova-assumption-normality-normal-distribution-of-residuals), & [here](http://stats.stackexchange.com/questions/45671/normality-of-residuals-vs-sample-data-what-about-t-tests), & probably a few other places. – Scortchi - Reinstate Monica Mar 07 '14 at 11:24
  • @Scortchi In contrast to a mechanic computer you had beforehand? – Henrik Mar 07 '14 at 12:22
  • @Holaquetal The paper shows why transforming is wrong. Although papers still do it, it is wrong. You should do better. And I linked you the paper that tells you how. – Henrik Mar 07 '14 at 12:23
  • @Henrik: I'm not old enough to remember [mechanical computers](http://en.wikipedia.org/wiki/Mechanical_computer) (unless slide rules count), but old enough to remember "electronic" or "digital" commonly being prefixed to "computer"; & enough to remember transformations being used because they allowed the analysis to be done by hand-calculation. – Scortchi - Reinstate Monica Mar 07 '14 at 12:55

1 Answers1

3

It should be clear that a counted fraction $X$ from $n$ can't follow a normal distribution however it's expressed—it has a discrete probability mass function whereas the normal has a continuous density. Nevertheless, the distribution of the proportion $P=\frac{X}{n}$ will approximate that of the normal more closely as the sample size $n$ increases. A bigger problem for the general linear model, of which ANOVAR is an instance, is heteroskedasticity: the bounds on proportions imply that the variance of $P$ varies with its mean. The main motivation for the angular transformation you mention is to stabilize the variance. If the function $f(P)$ is approximated by a Taylor series around $\pi$, the mean of $P$,

$$\newcommand{\d}{\mathrm{d}}\newcommand{\var}{\operatorname{Var}}f(P) \approx f(\pi) + (P-\pi)\frac{\d f(\pi)}{\d \pi}$$

then its variance, assuming a binomial distribution for $X$, is given by

$$\var f(P) \approx \var\left[ f(\pi) + (P-\pi)\frac{\d f(\pi)}{\d \pi}\right] \\ \approx \left(\frac{\d f(\pi)}{\d \pi}\right)^2 \var(P) \\ \approx \left(\frac{\d f(\pi)}{\d \pi}\right)^2 \frac{\pi(1-\pi)}{n}$$

For this function to achieve approximately constant variance you require $$ \left(\frac{\d f(\pi)}{\d \pi}\right)^2\propto \frac{1}{\pi(1-\pi)} $$ & thus $$ f(\pi) \propto \int{ \sqrt{\frac{n}{\pi (1-\pi)}}\, \d \pi} \\ \propto\operatorname{asin}\sqrt{\pi} $$

When the machines turn against us this will again be invaluable knowledge; until that time comes follow @Henrik's advice & model discrete data as what they are.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248