How to treat this OLS based on residual diagnostics

Question

I am struggling already a couple of days with this simple OLS, can you help?

Outcome years in function of predictor score, very simple linear model. The residual plot does absolutely not look good though.

Is it correct that based on the residual plot versus the outcome variable, to say that "if I predict the outcome to be 12, I most of the time over estimate the years of eduction?"

In blue my fitted OLS, in red a LOESS curve. What am I doing wrong or how to improve OLS?

enter image description here

I have tried to both transform the predictor and the outcome, no luck. Is there something else you can suggest?

The specific problem I have is that I am making wrong predictions when the outcome age is 12 years (ref table below, vertical are my predictions, horizontal the true values). How to solve this issue?

enter image description here

Updated graph after Glen_b his answer – Kasper Jun 09 '14 at 02:46 — Kasper, Jun 09 '14 at 02:46

score 6 · Accepted Answer · edited Apr 13 '17 at 12:44

1) A plot of residuals against $y$ will always show a correlation (unless the predictor is perfectly useless). See here and here.

2) On the other hand, the appearance of downsloping lines in your last plot are a direct result of the fact that your $y$ is discrete.

3) The fact that $y$ is discrete isn't necessarily a concern. It seems to me there is a related concern though - in that years of education is effectively bounded, and that means as you approach the bounds a linear fit may be biased. There's a slight indication of that at the low end of your fit. If that small bias is enough for you to worry about, you'll need to do something about it. It also has tendency to cause heteroskedasticity, but there's not strong indication of that here.

The bias is actually very small, so it may not be much of a problem:

enter image description here

The effective lower bound of 12 years of education produces this small bias, as the linear model plows straight through the bound, while (as the loess shows), the local relationship flattens out.

Your table illustrates something else you expect to see - regression to the mean. That table is giving you a misleading impression of the quality of your model, by confounding two different effects (one you want to try to avoid, but which is small in impact, and one you almost surely don't want to avoid).

Consider this linear model (which has no bias from any boundary):

enter image description here

Looking at rows and columns of your table is like taking horizontal and vertical slices in the plot. The regression should give you an approximately unbiased relationship in slices of the fitted (the red $y=x$ line is near the middle of the green slice), but won't (and should not!) give you an unbiased fit the other way (the blue slice). If you look at the blue slice, all the points are to the left of the line.

If you sliced my points up into squares like yours (sliced in the blue and green directions), my table would look a lot like yours. That's not a problem with the model - my model was the one that generated the data.

As for how to deal with the small bias you do have from the lower boundary -

You might try, for example, a logistic model limited to 12-18 (or some larger value for the upper bound, if you can find an effective upper limit that would apply more broadly than your sample; 20 may do better than 18, for example, but you may be able to inform it by external information), or any of a variety of other nonlinear models that incorporate a lower bound (there are many such functions in common use).

So that would involve using nonlinear least squares.

Another possibility is to fit some smooth function - indeed that's exactly what you have already with your loess curve. What's wrong with using that fit? Spline curves could work; but you would probably want to use natural splines with only a few knots, all down the left end (you may even want to force it to be horizontal at the extreme left, but if you do that you might as well go to the nonlinear regression I started with)

Thank you! I understood the mistake of plotting residuals against y and changed it to the residuals against y_hat. Now about the bias, it is a concern, especially because the distribution of education is skewed a lot to the right, about half of all the values are at the lower bound of education. What did you have in mind that respects the boundary? I have tried to work with splines but not really an improvement... — Kasper, Jun 09 '14 at 02:49
I updated the questions with outcome results, which indicate the bias problem that you mentioned... — Kasper, Jun 09 '14 at 03:37
Could you please add a few more words why "regression to the mean" is involved? I get why my table should only be read horizontally, but I don't see why my wrong intuition of reading it vertically has something to do with regression to the mean. — Kasper, Jun 09 '14 at 10:33
We tend to (visually) expect that the line would go approximately through where the blue and green slices cross. But the expected value of the response at a specific value of $x$ is where the red line is. That shift from what we tend to expect by looking at the relationship toward a flatter line is literally regression toward the mean. It's one aspect of the tendency to see the table as suggesting a bad fit. — Glen_b, Jun 09 '14 at 10:45

How to treat this OLS based on residual diagnostics

1 Answers1