1

I read somewhere,

In a regression problem, if the relationship between each predictor variable and the criterion variable is nonlinear, then the predictions may systematically overestimate the actual values for one range of values on a predictor variable and underestimate them for another.

Does anyone have an explanation to this behavior?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Farhaneh Moradi
  • 21
  • 1
  • 1
  • 3
  • 2
    nonlinearity **is** the explanation, assuming you used a linear regression problem. Just draw some non-linear curve approximating some straight line. – kjetil b halvorsen Nov 03 '15 at 18:39
  • is there a way to customize this range of values? – Farhaneh Moradi Nov 03 '15 at 19:19
  • Why do you want your algorithm to predict some wrong value? – kjetil b halvorsen Nov 03 '15 at 19:22
  • 1
    consider, I have a dataset that it's target values range from -1 to 1. For test cases with target value near 1, I want to get predicted value larger than it's real value in test set. For example if the real value in a test case is 0.9, I want from my regression algorithm to predict it 0.95. And for test cases with target value near -1, I want to get predicted value smaller than it's real value in test set. For example if the real value in a test case is -0.9, I want to my regression algorithm to predict it -0.95. Is it possible? – Farhaneh Moradi Nov 03 '15 at 19:48
  • ?? I still do not understand why you want this. Also, I do not see connection with the question as posted. – kjetil b halvorsen Nov 03 '15 at 19:55
  • 1
    The question seems perfectly clear to me. I don't see that the OP *wants* to do this; they just seem to want to *understand* it, which I think is perfectly reasonable. – gung - Reinstate Monica Nov 04 '15 at 00:49

1 Answers1

4

Here is a simple example (coded in R). Hopefully the image is sufficiently obvious to explain how a nonlinear relationship (not model), when fit with a linear relationship, yields regions with predicted values that are systematically overestimated and underestimated.

set.seed(7439)                # this makes the example exactly reproducible
x = runif(50, min=0, max=15)  # X is uniformly distributed from 0 to 15
 # this is the true data generating process: 
y = 3.7 - 2.5*x + 0.56*x^2 - 0.028*x^3 + rnorm(50, mean=0, sd=.3)
model = lm(y~x)               # here I fit a model with a linear relationship

windows()
  plot(x, y)                       # this plots the data
  abline(coef(model), col="gray")  # this plots the model's predicted values

enter image description here

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • thanks, but what happened when we have nonlinear relationship and also model it with a nonlinear algorithm? – Farhaneh Moradi Nov 04 '15 at 05:11
  • 1
    If the model's functional from is appropriate, you should be fine. – gung - Reinstate Monica Nov 04 '15 at 05:29
  • thanks, but what happened when we have nonlinear relationship and also model it with a nonlinear algorithm? I saw overestimated and underestimated predictions in low and high values of targets respectively when using nonlinear model too. – Farhaneh Moradi Nov 04 '15 at 05:32
  • This was just answered by @gung. If you pick the correct functional form of the nonlinear model, you should be fine. There shouldn't be a systematic underestimate or overestimate. – StatsStudent Nov 04 '15 at 05:52
  • We can go a bit further to investigate the goodness of fit. `summary(model) ... Multiple R-squared: 0.04939, Adjusted R-squared: 0.02959 F-statistic: 2.494 on 1 and 48 DF, p-value: 0.1208` You can observe the R-squared. It's too low. 2% indicates that the model explains almost none of the variability of the response data around its mean. – Mohammad Kibria Apr 20 '20 at 04:45
  • @MohammadKibria, R-squared is not goodness of fit (see [Is $R^2$ useful or dangerous?](https://stats.stackexchange.com/q/13314/)). You can have the same situation as here w/ R-squared as high as you like. – gung - Reinstate Monica Apr 20 '20 at 05:52