Testing assumptions of multiple regression

Question

When doing a multiple regression and testing for homoscedasticity some people look at raw observations and others the residuals. Which is correct?
Do you use raw data or residuals to test linearity?
Do you test the homoscedasticity for each IV against the DV or do you put all IVs in at the same time and then test for homoscedasticity?
When do you test the assumptions before running the analysis, after, or both?
What order do you do these things? Do you do any twice?
- test for linearity
- test for normal distribution
- test for equal variances
- run the multiple linear regression

(4) and (5) are answered at http://stats.stackexchange.com/questions/32600/in-what-order-should-you-do-linear-regression-diagnostics. (1) is answered in a very large number of comments (and a few answers); consider this a FAQ. — whuber, Dec 11 '12 at 22:27

Glen_b · Answer 1 · 2014-03-06T22:30:16.140

The answers mostly derive from considering the question 'what is actually being assumed?'.

Do you know the actual assumptions?

(Note that the distributional assumptions are conditional, not marginal.)

1 When doing a multiple regression and testing for homoscedasticity some people look at raw observations and others the residuals. Which is correct?

What's the actual assumption here?

2 Do you use raw data or residuals to test linearity?

Which shows deviations from the model assumptions best?

3 Do you test the homoscedasticity for each IV against the DV or do you put all IVs in at the same time and then test for homoscedasticity?

See (1)

4 When do you test the assumptions before running the analysis, after, or both?

What exactly do you mean by 'running the analysis' here?

(If you use residuals, how would you do it before doing the calculations?)

If you mean 'before/after doing the formal inference based off the model fit', I'd normally say 'notionally before', but in what actual way would the order make a difference?

5 What order do you do these things?

This question is confusing. The last part:

test for linearity test for normal distribution test for equal variances run the multiple linear regression .

should have been right after the word 'things', like so:

5 What order do you do these things (check for linearity; check for normal distribution; check for equal variances; run the multiple linear regression)?

Again, if you use residuals for anything, how would you check (NB check, not test) those assumptions before calculating the residuals?

You can't check the assumption relating to conditional variance if linearity doesn't hold.

You can't check the assumption relating to normality if homoscedasticity doesn't hold.

Linearity is the basic assumption ('is my model for the mean appropriate?').

Variance is the next most important, and can't be checked until linearity is at least approximately satisfied

Normality is least important (if sample sizes aren't small... unless you're producing prediction intervals - then it matters even at large sample sizes) and can't be checked unless your data is at least approximately homoscedastic.

Do you do any twice?

Only where it would make a difference to do so.

score 1 · Answer 2 · answered Dec 12 '12 at 19:01

Understanding the science behind your data is more important that the tests on the conditions. The tests on conditions include their own assumptions, are you going to test those as well? You could end up in an infinite loop if you always test the underlyng assumptions. Further many of the tests assume that the data being tested is iid, but residuals are not iid (and raw data is only iid if there is no relationship). Tools like standardizing and studentizing bring the residuals closer to being iid (if all the assumptions hold), but they will still not be exactly iid.

Many of the assumptions/conditions for statistical proceedures (especially the normal theory ones) are robust under certain conditions (a moderate departure from the condition will not cause a very big change in the inference). It turns out that in some cases (maybe all) that when the tests are robust is when you have enough power to find differences from the assumptions that do not really matter. Conversly when the test is not robust (and therefore it is important to know if the assumptions hold) the tests on assumptions/conditions often don't have enough power to detect differences that would be important.

What is much more important is to understand the underlying science that produced your data. Are you comforatble with the assumptions or do you believe that the assumptions are likely to be violated? If so, by how much and is your proceedure robust to that level of violation (simulations can help with that last part). Still look at residual plots, just rely on the science more than exact cutoffs for p-values on tests for conditions.

Testing assumptions of multiple regression

2 Answers2

Linked