Why is linear regression overestimating small values and underestimating big values?

Question

I am trying to predict age from a couple of variables using linear regression, but when I plot predicted age against real age, I can see that small values are significantly overestimated and big values are underestimated. If I flip the axes, so my predicted values are on the X axis, the regression line is as straight as it can be.

set.seed(1)
n = 50
age <- 1:n+10
variable1 <- age + 100 + rnorm(n, 0, 45)
variable2 <- age + 100 + rnorm(n, 0, 45)
variable3 <- age + 100 + rnorm(n, 0, 45)
variable4 <- age + 100 + rnorm(n, 0, 45)
data <- as.data.frame(cbind(age, variable1,
                            variable2, variable3, variable4))

fit_all <- lm(age ~ ., data=data)
predictions_all <- predict(fit_all, data)

par(mfrow=c(1,2))
plot(predictions_all ~ age, ylab='Predicted age', xlim=c(5,65), ylim=c(5,65), main='underestimated/overestimated age')
abline(coef(lm(predictions_all ~ age)), col='red')
plot(age ~ predictions_all, xlab='Predicted age', xlim=c(5,65), ylim=c(5,65), main='X-Y axes fliped')
abline(coef(lm(age ~ predictions_all)), col='red')

I think I understand what is going on in case of 1 variable, an average 10-year-old sample will have a value of a variable5 around 117, however an average sample with a variable5 of 117 will be around 20 years old, hence the bias.

I still can't get my head around this situation in case of multiple variables and what to do about it. There is a similar question here Why are the residuals in this model so linearly skewed? where the answer is basically don't worry about it and plot residuals instead, however that is not solving my problem of a systematic bias of my predictions, which is what I care about.

I've changed the numbers and figures to make the problem more visible. You cans ee that predicted age for 10yo is around 30 and for 60yo is around 40 — rep_ho, Aug 16 '18 at 14:46
Seems like regression to the mean, which is expected behavior and happens for good reasons. The idea is that since age is a response in your model, for the response to be extreme you need both an extreme prediction *and* to be lucky. When you plug in the same predictors you ignore the luck part so you end up with predictions that are shrunk towards the mean. Again, this is the right thing to do and does not indicate bias. — guy, Aug 16 '18 at 15:00
but shouldn't luck and unluck average itself out and then predict for 10yo at average 10 years etc? — rep_ho, Aug 16 '18 at 15:20
Things won’t average out in the way you are thinking. Think of it this way: you take an exam and get a 100% after recording X. If you took another exam you wouldn’t predict 100% even with the same X because in addition to having good attributes you also probably got lucky. Or Bob has a kid who happens to be 7 feet tall; you don’t predict a second kid to also be 7 feet because 7 feet people are rare, even though they have the same father. It’s the same type of phenomenon occurring here, but it he intuition is a bit harder to see. — guy, Aug 16 '18 at 15:34
Your eye deceives you. For various explanations see https://www.google.com/search?q=regression+towards+the+mean. — whuber, Aug 16 '18 at 18:08

Aksakal · Answer 1 · 2018-08-17T13:49:30.543

5

Short answer is that your variables are generated from the age. When you try to regress the age on the variables, your model violates one of the assumptions of linear regression called exogeneity. In a nutshell this assumption requires that the errors are not correlated with regressors. However, by construction your errors and variables contain the same psudo random sequences in them, therefore they are correlated.

Here's how it happens. Your data generation process (DGP) is $$x_{it}=t+\varepsilon_{it}$$ where $x_i$ - a variable and $\varepsilon_i$ is its error. We also know that the errors are independent and Gaussian: $var[\varepsilon_i,\varepsilon_j]=\sigma^2\delta_{ij}$. So, the only model that is consistent with DGP is $$X_t=tB+E_t,$$ where $X_t,B,E_t$ are vectors.

However, you're trying to invert the problem, so to speak. Let's see why it leads to an issue. Consider a linear combination of your variables: $$\sum_ix_{it}\beta_i=t\sum_i\beta_i+\sum_i\beta_i\varepsilon_{it}$$ Re-arrange it as follows to get to a form that starts to look similar to your model: $$t=\sum_i\frac{\beta_i}{\sum_i\beta_i}x_{it}-\sum_i\frac{\beta_i}{\sum_i\beta_i}\varepsilon_{it}$$

If we introduce the scaled coefficients: $\beta'_i=\frac{\beta_i}{\sum_i\beta_i}$, and substitute them in the equation above, we get the following: $$t=\sum_i\beta'_ix_{it}-\sum_i\beta'_i\varepsilon_{it}$$

Since the linear combination of Gaussians is Gaussian itself, we can introduce a new stochastic variable:$\xi_t=-\sum_i\beta'_i\varepsilon_{it}$, which is $\xi_t\sim\mathcal N(0,\sum_i\sigma^2\beta'^2_i)$, hence we have what looks like your regression model at a first glance: $$t=\sum_i\beta'_ix_{it}+\xi_t$$

Despite the similarity there is a big difference. The linear regression model $y=X\beta+\epsilon$ in order for it to possess nice properties such as unbiasedness needs to satisfy certain conditions, called Gauss-Markov conditions, including one called exogeneity: $E[\epsilon|X]=0$ This is exactly what you wanted to see, i.e. that the residuals are centered around predictors.

Unfortunately, the dataset that you created violates this condition by design. Intuitively it's easy to see if you consider that the psudo random errors that you used to generate random regressors, are also the only source of randomness in your data set, hence they also form the "errors" of the model. Therefore, it must be that the errors in your model are correllated with the regressors, which violates exogeneity. You are observing it as errors having bias dependent on the level of variables X.

edited Aug 17 '18 at 13:49

answered Aug 16 '18 at 14:55

Aksakal

55,939
5
90
176

Thanks, this is not completely clear to me. Are you saying that this happens because I have an error in variables and not in the outcome? – rep_ho Aug 16 '18 at 14:58
Yes, are trying to predict the predictor from outcomes. I'm not going into further analysis of how exactly it should impact your model, but this setup effectively leads to misspecification – Aksakal Aug 16 '18 at 14:59
Easy way to understand the issue is to think about what should be coefficients of your regression given the way you created your sample dataset – Aksakal Aug 16 '18 at 15:04
1

Specifically this is an errors-in-variables model; see https://en.wikipedia.org/wiki/Errors-in-variables_models – Frank Aug 16 '18 at 15:05
I guess coefficients should be around 1 + intercept. Although this is a small simulation, but I see the same behaviour in my real data also trying to predict age with bunch of probably noisy measurements. – rep_ho Aug 16 '18 at 15:07
@rep_ho I do not think this answers the question. If OP had simulated age to be Gaussian then his model would be correctly specified (it being a multivariate Gaussian) but the problem would persist. I think OP is just observing regression to the mean. – guy Aug 16 '18 at 15:07
ok, I've got it why they are not 1 – rep_ho Aug 16 '18 at 15:27
run your code with different seeds and watch the coefficients – Aksakal Aug 16 '18 at 15:31
but cannot all my errors in variable be treated as a one scalar error that is the sum of all these partial errors?It's possible that I sumulated them pretty colinear, but that shouldn't affect the predictions – rep_ho Aug 16 '18 at 15:55
no, see my update to the answer. your're trying to infer the predictor from the observed outcome. It's a significantly different problem than the one you are regressing – Aksakal Aug 16 '18 at 17:17
I don't understand your edit. The variables are certainly not perfectly colinear. OP has a separate error term for each variable, and if we just impose that age is normally distributed marginally then OPs model is exactly correct. – guy Aug 16 '18 at 18:05
Again though, I don't see how this answers the question because the model is not mis-specified, and OPs problem persists, if we just make the minor adjustment that age is normally distributed. From a statistical standpoint, the distinction between predictor/response would be irrelevant; the only difference is causal because of how OP generated the data. – guy Aug 16 '18 at 18:27
@guy, I reformulated my explanation to show that conditional exogeneity is violated – Aksakal Aug 16 '18 at 21:33
@Aksakal the point is that, again, when the response is standard normal we can write $Y = (t,X) \sim N(\mu, \Sigma)$ for some mean vector $\mu$ and covariance $\Sigma$. It follows that the conditional distribution of $t$ can be expressed as $t = \mu_t + \Sigma_{tX} \Sigma_{XX}^{-1} (X - \mu_X) + \epsilon_t$ where $\epsilon_t \sim N(0, \sigma^2_t - \Sigma_{Xt}^{\top} \Sigma_{XX}^{-1} \Sigma_{Xt})$ and $\epsilon_t$ is *independent of $X$*. Hence OP's model is correct: the conditional distribution of $t$ is linear in $X$ with an independent error. – guy Aug 16 '18 at 22:03
@Aksakal I think all you are showing with your edit is that the coefficients $\Sigma_{XX}^{-1} \Sigma_{tX}$ are not given by $\beta_i / sum_i \beta_j$. The misunderstanding is because you are thinking causally/structurally, as the data were generated (with $t$ as a cause). If you just think in terms of an association, when you have a multivariate Gaussian you can *always* decompose $t = \alpha + X^{\top}\beta + \epsilon$ where $\epsilon \perp X$; you just won't necessarily get the $\beta$ that your structural model is suggesting. – guy Aug 16 '18 at 22:12

score 0 · Answer 2 · answered Aug 16 '18 at 14:53

0

When I use your code I get a different result, because of the random generation of the data. See image. if you use set.seed(354) (putting various numbers in place of 354) you will get different pictures. The heteroskedasticity of the residuals shouldn't be something that persists if you were to repeat the exercise several times.

answered Aug 16 '18 at 14:53

Chris Umphlett

566
3
12

Thanks for your answer. I don't care about heteroscedasticity of residuals in the second plot, about the slope of the regression line in the first plot, which indicates underestimation and overestimation of age for younger and older samples – rep_ho Aug 16 '18 at 14:56

Why is linear regression overestimating small values and underestimating big values?

2 Answers2