Overestimated and underestimated predictions in regression

Question

I read somewhere,

In a regression problem, if the relationship between each predictor variable and the criterion variable is nonlinear, then the predictions may systematically overestimate the actual values for one range of values on a predictor variable and underestimate them for another.

Does anyone have an explanation to this behavior?

nonlinearity **is** the explanation, assuming you used a linear regression problem. Just draw some non-linear curve approximating some straight line. — kjetil b halvorsen, Nov 03 '15 at 18:39
consider, I have a dataset that it's target values range from -1 to 1. For test cases with target value near 1, I want to get predicted value larger than it's real value in test set. For example if the real value in a test case is 0.9, I want from my regression algorithm to predict it 0.95. And for test cases with target value near -1, I want to get predicted value smaller than it's real value in test set. For example if the real value in a test case is -0.9, I want to my regression algorithm to predict it -0.95. Is it possible? — Farhaneh Moradi, Nov 03 '15 at 19:48
?? I still do not understand why you want this. Also, I do not see connection with the question as posted. — kjetil b halvorsen, Nov 03 '15 at 19:55
The question seems perfectly clear to me. I don't see that the OP *wants* to do this; they just seem to want to *understand* it, which I think is perfectly reasonable. — gung - Reinstate Monica, Nov 04 '15 at 00:49

score 4 · Accepted Answer · answered Nov 04 '15 at 03:39

4

Here is a simple example (coded in R). Hopefully the image is sufficiently obvious to explain how a nonlinear relationship (not model), when fit with a linear relationship, yields regions with predicted values that are systematically overestimated and underestimated.

set.seed(7439)                # this makes the example exactly reproducible
x = runif(50, min=0, max=15)  # X is uniformly distributed from 0 to 15
 # this is the true data generating process: 
y = 3.7 - 2.5*x + 0.56*x^2 - 0.028*x^3 + rnorm(50, mean=0, sd=.3)
model = lm(y~x)               # here I fit a model with a linear relationship

windows()
  plot(x, y)                       # this plots the data
  abline(coef(model), col="gray")  # this plots the model's predicted values

answered Nov 04 '15 at 03:39

gung - Reinstate Monica

132,789
81
357
650

thanks, but what happened when we have nonlinear relationship and also model it with a nonlinear algorithm? – Farhaneh Moradi Nov 04 '15 at 05:11
1

If the model's functional from is appropriate, you should be fine. – gung - Reinstate Monica Nov 04 '15 at 05:29
thanks, but what happened when we have nonlinear relationship and also model it with a nonlinear algorithm? I saw overestimated and underestimated predictions in low and high values of targets respectively when using nonlinear model too. – Farhaneh Moradi Nov 04 '15 at 05:32
This was just answered by @gung. If you pick the correct functional form of the nonlinear model, you should be fine. There shouldn't be a systematic underestimate or overestimate. – StatsStudent Nov 04 '15 at 05:52
We can go a bit further to investigate the goodness of fit. `summary(model) ... Multiple R-squared: 0.04939, Adjusted R-squared: 0.02959 F-statistic: 2.494 on 1 and 48 DF, p-value: 0.1208` You can observe the R-squared. It's too low. 2% indicates that the model explains almost none of the variability of the response data around its mean. – Mohammad Kibria Apr 20 '20 at 04:45
@MohammadKibria, R-squared is not goodness of fit (see [Is $R^2$ useful or dangerous?](https://stats.stackexchange.com/q/13314/)). You can have the same situation as here w/ R-squared as high as you like. – gung - Reinstate Monica Apr 20 '20 at 05:52

Overestimated and underestimated predictions in regression

1 Answers1

Linked