28

ANOVA vs multiple linear regression?

I understand that both of these methods seem to use the same statistical model. However under what circumstances should I use which method?

What are the advantages and disadvantages of these methods when compared?

Why is ANOVA so commonly used in experimental studies and I hardly ever find a regression study?

Quentin
  • 119
  • 4
florian
  • 511
  • 1
  • 4
  • 12
  • 5
    Since both use the same model, it doesn't matter which you use. – Peter Flom Jan 16 '16 at 16:15
  • 4
    I call it regression when I am comparing slopes, i.e. continuous predictor variables, and ANOVA when I am comparing means, i.e. categorical predictor variables. The reason you find ANOVA more in experimental studies is because they are mostly comparing means, or levels of treatments, e.g. comparing different fertilizers on plant growth. But as @PeterFlom already said both use the same model and it doesn't matter which one you use - the only thing that looks different is the output they give you - and depending on your question you either want the "regression" output or the "ANOVA" output. – Stefan Jan 16 '16 at 16:31
  • 2
    Hmm but you could also include categorical predictors in a regression via dummy coding? – florian Jan 16 '16 at 16:34
  • Yes, of course! – Stefan Jan 16 '16 at 16:35
  • Hmm I still don't understand it..following your interpretation with the fertilizers what would be the interpretation of an ANOVA and what knowledge could you derive from a regression? – florian Jan 16 '16 at 18:52
  • 4
    Your question is very valid, and has been addressed a number of times from different perspectives on CV. The duplicate nature of these tests is puzzling. It's easy to say ANOVA = linear regression, and I do think that all the comments made so far are helpful and on point, but the reality is a bit more nuanced and difficult to understand, especially if you include ANCOVA under the umbrella of analysis of variance. Check other entries, such as [this one](http://stats.stackexchange.com/a/76292/67822). I'm +1 your question, although it is, strictly speaking, a duplicate. Can you give an ex.? – Antoni Parellada Jan 16 '16 at 19:21
  • Thank you for your response, I already found the other post before posting mine however it deals with only 1 independent variable...I am helping a friend with her thesis, she has two experimental groups...the dependent variable is a metric value and there are around 10 independent variables which are either dichotomized (as dummies) or metric. – florian Jan 17 '16 at 19:42
  • ANOVA and multiple linear regression are trying to answer different questions. – Mike Jan 18 '16 at 02:08
  • Here's a good read on when to use ANOVA v. regression: Cottingham et al 2005 (free download: http://byrneslab.net/classes/biol607/readings/cottingham_et_al_2005_frontiers_all.pdf) – 757nigel Apr 25 '17 at 16:56
  • Please elaborate, as surely you don't mean the entire post is not true. I have found the facts that are listed to be true, although I might not necessarily agree with all of the opinions. – user171973 Aug 01 '17 at 16:20
  • I would just like to add that if your dummy variables are consecutive integers, then the differences in means is equivalent to the slope. – abalter Apr 02 '20 at 04:49

4 Answers4

26

It would be interesting to appreciate that the divergence is in the type of variables, and more notably the types of explanatory variables. In the typical ANOVA we have a categorical variable with different groups, and we attempt to determine whether the measurement of a continuous variable differs between groups. On the other hand, OLS tends to be perceived as primarily an attempt at assessing the relationship between a continuous regressand or response variable and one or multiple regressors or explanatory variables. In this sense regression can be viewed as a different technique, lending itself to predicting values based on a regression line.

However, this difference does not stand the extension of ANOVA to the rest of the analysis of variance alphabet soup (ANCOVA, MANOVA, MANCOVA); or the inclusion of dummy-coded variables in the OLS regression. I'm unclear about the specific historical landmarks, but it is as if both techniques have grown parallel adaptations to tackle increasingly complex models.

For example, we can see that the differences between ANCOVA versus OLS with dummy (or categorical) variables (in both cases with interactions) are cosmetic at most. Please excuse my departure from the confines in the title of your question, regarding multiple linear regression.

In both cases, the model is essentially identical to the point that in R the lm function is used to carry out ANCOVA. However, it can be presented as different with regards to the inclusion of an intercept corresponding to the first level (or group) of the factor (or categorical) variable in the regression model.

In a balanced model (equally sized $i$ groups, $n_{1,2,\cdots\, i}$) and just one covariate (to simplify the matrix presentation), the model matrix in ANCOVA can be encountered as some variation of:

$$X=\begin{bmatrix} 1_{n_1} & 0 & 0 & x_{n_1} & 0 & 0\\ 0 & 1_{n_2} & 0 & 0 & x_{n_2} & 0\\ 0 & 0 & 1_{n_3} & 0 & 0 & x_{n_3} \end{bmatrix}$$

for $3$ groups of the factor variable, expressed as block matrices.

This corresponds to linear model:

$$y = \alpha_i + \beta_1\, x_{n_1}+ \beta_2\,x_{n_2} \,+ \beta_3\,x_{n_3}\,+ \epsilon_i$$ with $\alpha_i$ equivalent to the different group means in an ANOVA model, while the different $\beta$'s are the slopes of the covariate for each one of the groups.

The presentation of the same model in the regression field, and specifically in R, considers an overall intercept, corresponding to one of the groups, and the model matrix could be presented as:

$$X=\begin{bmatrix} \color{red}\vdots & 0 & 0 &\color{red}\vdots & 0 &0 & 0\\ \color{red}{J_{3n,1}} & 1_{n_2} & 0 & \color{red}{x} & 0 & x_{n_2} & 0\\ \color{red}\vdots& 0 & 1_{n_3} & \color{red}\vdots & 0 & 0 & x_{n_3} \end{bmatrix}$$

of the OLS equation:

$$y =\color{red}{\beta_0} + \mu_i +\beta_1\, x_{n_1}+ \beta_2\,x_{n_2} \,+ \beta_3\,x_{n_3}\,+ \epsilon_i$$.

In this model, the overall intercept $\beta_0$ is modified at each group level by $\mu_i$, and the groups also have different slopes.

As you can see from the model matrices, the presentation belies the actual identity between regression and analysis of variance.

I like to kind of verify this with some lines of code and my favorite data set mtcars in R. I am using lm for ANCOVA according to Ben Bolker's paper available here.

mtcars$cyl <- as.factor(mtcars$cyl)         # Cylinders variable into factor w 3 levels
D <- mtcars  # The data set will be called D.
D <- D[order(D$cyl, decreasing = FALSE),]   # Ordering obs. for block matrices.

model.matrix(lm(mpg ~ wt * cyl, D))         # This is the model matrix for ANCOVA

As to the part of the question about what method to use (regression with R!) you may find amusing this on-line commentary I came across while writing this post.

Antoni Parellada
  • 23,430
  • 15
  • 100
  • 197
  • 1
    Thank you for this extremely helpful comment... Quoting from the commentary you linked: " Use regression when you aren't sure whether the independent categorical variables have any effect at all. Use ANOVA when you want to see whether particular categories have different effects." So how come many experimental studies use ANOVA then? From my understanding regression would be the right choice. Are researchers too convinced that the effects are there and only searchign for ways to statistically "prove" them? – florian Jan 19 '16 at 10:56
  • Could you provide a practical example where one should use aov over regression and explain why? Thanks for your time.I am also a psychologist by training and fail to see the advantages of Anova except that is probably published easier. – florian Jan 19 '16 at 14:10
  • Any luck? I would be very interested in any more concrete heuristic to favor either type of procedure, so please share if you find an answer. – Antoni Parellada Jan 20 '16 at 21:38
  • Unfortunately no new discoveries on my journey into Statistics so far...will keep you posted, more input is appreciated. – florian Jan 25 '16 at 17:57
  • I am having difficulty understanding the OLS model matrix and the corresponding equation here. I don't understand where the zero column comes from (5th column of the matrix). Also, I think that the equation should correspond to the columns (i.e. mu_i should be only for two groups and the x-variable should be included without interaction with a group dummy). Additional clarification is much appreciated! – Nick Jan 17 '19 at 17:49
  • @Nick The model matrix is correct. Compare it with the output in R in the code in my response, and you will see the zero column reflected there. – Antoni Parellada Jan 17 '19 at 23:58
  • Getting back to this. First, thanks for your swift reply. However, I don't see a zero column in the R output. After all a zero column does not make sense. The rest I understand and seems correct. I would appreciate it if you could look at this again! BTW, what is the D option and why is it necessary? – Nick May 28 '19 at 18:23
7

ANOVA and OLS regression are mathematically identical in cases where your predictors are categorical (in terms of the inferences you are drawing from the test statistic). To put it another way, ANOVA is a special case of regression. There is nothing that an ANOVA can tell you that regression cannot derive itself. The opposite, however, is not true. ANOVA cannot be used for analysis with continuous variables. As such, ANOVA could be classified as the more limited technique. Regression, however, is not always as handy for the less sophisticated analyst. For example, most ANOVA scripts automatically generate interaction terms, where as with regression you often must manually compute those terms yourself using the software. The widespread use of ANOVA is partly a relic of statistical analysis before the use of more powerful statistical software, and, in my opinion, an easier technique to teach to inexperienced students whose goal is a relatively surface level understanding that will enable them to analyze data with a basic statistical package. Try it out sometime...Examine the t statistic that a basic regression spits out, square it, and then compare it to the F ratio from the ANOVA on the same data. Identical!

  • This is not true. – Michael R. Chernick Aug 01 '17 at 01:20
  • 6
    @MichaelChernick Could you elaborate on which of the many assertions made in this answer you think are untrue? Although it takes some extreme positions, it's hard to find any that are false. – whuber Aug 07 '17 at 16:32
  • I objected to the statement that ANOVA and OLS regression are mathematically identical. I recognize that ANOVA can be looked at as regression at aa a form of the general linear model that can be formulated like regression. – Michael R. Chernick Aug 07 '17 at 16:50
  • 2
    In the OLS case, how are they not identical other than the output? The underlying model is the same, the residuals are the same, the p-values they produce are the same. It is the output that differs. – dbwilson Mar 25 '19 at 14:26
3

The main benefit of ANOVA ovethe r regression, in my opinion, is in the output. If you are interested in the statistical significance of the categorical variable (factor) as a block, then ANOVA provides this test for you. With regression, the categorical variable is represented by 2 or more dummy variables, depending on the number of categories, and hence you have 2 or more statistical tests, each comparing the mean for the particular category against the mean of the null category (or the overall mean, depending on dummy coding method). Neither of these may be of interest. Thus, you must perform post-estimation analysis (essentially, ANOVA) to get the overall test of the factor that you are interested in.

dbwilson
  • 1,543
  • 7
  • 10
  • Actually, this is not true. If you perform a likelihood ratio test, you are testing the whole categorical factor as a block in a regression model. – Dan Chaltiel Mar 20 '19 at 18:01
  • Your comment doesn't contradict what I said. The likelihood ratio test that you mention would be a post-estimation analysis on the factor, comparing the model with the factor to the model without. – dbwilson Mar 25 '19 at 11:00
  • If you perform an ANOVA, you will get you a pvalue for "the categorical variable (factor) as a block", so is regression with LRT. Regression may provide you several beta but would not perform more tests than ANOVA, so your statement "hence you have 2 or more statistical tests" seems wrong to me. Why would LRT be more "post-estimation" than ANOVA ? – Dan Chaltiel Mar 25 '19 at 11:22
1

The major advantage of linear regression is that it is robust to the violation of homogeneity of variance when sample sizes across groups are unequal. Another is that it facilitates the inclusion of several covariates (though this can also be easily accomplished through ANCOVA when you are interested in including just one covariate). Regression became widespread during the seventies in the advent of advances in computing power. You may also find regression more convenient if you are particularly interested in examining differences between particular levels of a categorical variable when there are more than two levels present (so long as you set up the dummy variable in the regression so that one of these two levels represents the reference group). This could save you the time of having to conduct post-hoc tests to compare the means between groups after running ANOVA.

David B
  • 11
  • 1