help for interpreting the residuals vs. fitted values plot

Question

I have an ordinal variable (scale of stress) considered as continuous predictor in a multivariate linear regression. I would like to verify the assumption of linearity and this is my residuals vs. fitted values plot:

and the residuals vs. stress plot:

Is there a pattern? I don't know if it's problematic...

Do you have only one X variable, or are there others besides stress? — gung - Reinstate Monica, May 08 '16 at 17:31
In that case, `Fitted values` is comprised of several variables. How could you identify what is going on uniquely w/ stress? You need to plot the residuals against stress. — gung - Reinstate Monica, May 08 '16 at 17:46
Thank for your help. I added the residuals vs. stress plot. However, I thought that the plot with fitted values was required when regression was multivariate. When do you use fitted values? — Emmanuel W, May 08 '16 at 17:57
The term _multivariate_ regression is best reserved for multiple responses (in Stata, which you are using, e.g. `mvreg`). If it's just a matter of multiple predictors, that's _multiple_ regression (or increasingly just regression). — Nick Cox, May 08 '16 at 18:16
Thank you Nick. In this case, my regression is multiple but not multivariate. — Emmanuel W, May 08 '16 at 18:27
So residuals vs. fitted values plot is used for homoscedasticity and not for linearity? — Emmanuel W, May 08 '16 at 20:13

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

1

The qualitative gestalt of your plots doesn't reveal any patterns. Take any of the strips of vertical points on the second plot, and turn it on its side mentally, and then stack up on top of each other the points that are piled up resulting in darker shades of blue, so that they no longer overlap. What do you see? A Gaussian distribution, right? You can prove this to yourself plotting histograms.

This is more apparent in the more populated levels of stress, just because the majority of subjects are not at the extremes, but the distribution of the residuals seems perfectly centered from low stress levels to high, spreading equally above and below the purple line.

Remember that the Gauss-Markov theorem of a BLUE OLS calls for zero mean residuals with equal variance.

I would recommend you this and this. And I recall that Glen_b has an awesome post on the topic somewhere in this site. I'll look for it, and include it... OK I may have confused it for another great answer he gave on QQplots, but you can check this one.

There was follow-up question:

If my regressor "stress" had not been linearly associated with my outcome, would I have seen it with my residuals vs. fitted values plot?

So I did what I usually do... go back to the drawing board... RStudio, that is.

I couldn't come up with anything too exciting, but still... Here we have a synthetic dataset where the dependent variable $y$ is linearly related by design to both $x$ and $z$, but the $z$ independent variable was squared $z^2$ before generating $y$. The actual equation was: $y = 5 + 75 * x + 5 * z + 50 * z^2 $. And yes, I know that the model is linear even with a polynomial relationship, as long as the coefficients don't contain the variables as some function. But just for illustrative purposes, I wonder if it's OK to proceed.

This is what the data look like before the regression:

Initially I fitted the model $\hat y =\hat \beta_0 + \hat \beta_1\times x + \hat\beta_2 \times z$. And these are some of the diagnostic plots:

On the overall residuals v. fitted plot to the left the residuals are centered at zero, but their spread tapers to the right, suggesting heteroscedasticity. There isn't much additional information on the middle plot of these residuals against the $x$ variable, but there is a wealth of information on the final plot of residuals versus $z$, where a polynomial regression is strongly suggested.

So I ran the polynomial model $\hat y =\hat \beta_0 + \hat \beta_1\times x + \hat\beta_2 \times z + \hat \beta_3 \times z^2$, yielding much better diagnostic plots:

Does this answer your question? I don't know, but it certainly illustrates the value of plotting the residuals against the different variables in multiple regression.

edited Apr 13 '17 at 12:44

Community

1

answered May 08 '16 at 19:12

Antoni Parellada

23,430
15
100
197

Thank you Antoni for your answer! So residuals vs. fitted values plot is used for homoscedasticity and not for linearity? – Emmanuel W May 08 '16 at 19:58
@Emmanuel W Sorry! I completely missed the question in your comment. My take is that that your statement in the prior comment is unnecessarily restrictive. You can draw conclusions regarding the overall fit of an OLS approach base on the analysis of residuals vs. fitted values. I believe that in your particular question you were interested in just one of the regressors (stress), not the overall model. – Antoni Parellada May 08 '16 at 20:31
I don't want to try your patience but your anwers are very helpful for me. If my regressor "stress" had not been linearly associated with my outcome, would I have seen it with my residuals vs. fitted values plot? Or maybe overall fit is possible without a linear association for each predictor? – Emmanuel W May 08 '16 at 20:47
@EmmanuelW OK. I gave it a shot. See what you think. If you have follow-up questions, please go ahead. Otherwise, please consider accepting the answer. – Antoni Parellada May 09 '16 at 00:40
Your example is very clear but it generates new questions: 1) Using a polynomial regression seems to correct heteroscedasticity for x and linearity for z. Could the pattern of the first purple plot (heteroscedasticity) suggest a problem of linearity for one of variable or are there two separate problems ? 2) Could you explain me your phrase "the model is linear even with a polynomial relationship, as long as the coefficients don't contain the variables as some function"? (maybe a translation problem for me). – Emmanuel W May 09 '16 at 10:24
@EmmanuelW Again, good questions. Take a look at the "raw" data I just posted, and notice how when the x-axis is at say, $-3$ trying to predict $y$ is going to be very difficult, because the linear plot is telling OLS to predict a negative value, while the quadratic curve of $z$ is saying exactly the opposite. On the other hand, when you move to $+3$ they both agree. That's why you see the heteroscedasticity on the plots below. Fixing the quadratic problem in $z$ resolves everything. – Antoni Parellada May 09 '16 at 11:39
As for your second question, you can check [this answer](http://stats.stackexchange.com/a/92087/67822). – Antoni Parellada May 09 '16 at 11:45
Thank you Antoni. So finally, heteroscedasticity is linked to linearity...but I suppose that it's not always the case. Can we have linear associations between x and y as well as between z and y but heteroscedasticity for overall fit ? – Emmanuel W May 09 '16 at 13:48
@EmmanuelW Your questions are very good, but I don't feel too comfortable with short dichotomous conclusions. They make me think, and they are wonderful... but the exceptions are always around the corner... – Antoni Parellada May 09 '16 at 14:07
@EmmanuelW Kindly, would you agree that we have squeezed some good ideas out of your question? I think it is time to wrap this one up, and I would appreciate it if you clicked on "accept". If you get better answers in the future you can always reverse it. – Antoni Parellada May 09 '16 at 14:29

help for interpreting the residuals vs. fitted values plot

1 Answers1