13

I'm using linear regression to predict a price which is obviously positive. I have only one feature which is gross_area. I standardized it (z-score) I got this kind of value:

array([[ 1.        , -0.48311432],
       [ 1.        ,  0.68052306],
       [ 1.        ,  2.1426852 ],
       [ 1.        , -1.17398593],
       [ 1.        , -0.16265712]])

Where the 1 is the constant for the intercept term. I predict the parameters(predictors) and I got this:

array([[ 31780004.85045217],
       [ 27347542.4693376 ]])

Where the first cell is the intercept term and the second cell correspond to the parameter found for my feature gross_area.

My problem is the following, when I take for example the fourth line and I compute the matrix multiplication XB to get my prediction, I got this:

In [797]: np.dot(training[4], theta)
Out[797]: array([-325625.35640697])

Which is totally wrong since I cannot have negative value for my dependent variable. It seems like because of my normalization where I got negative value for my feature, I ended up with a negative predicted value for some tuple. How can is it possible and how can I fix this ? Thank you.

This is what I have predict graphically: enter image description here

with y=price , x =gross area

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Marc Lamberti
  • 397
  • 1
  • 3
  • 9
  • 2
    Can you elaborate on how price and area are related? The solution is either to use log price (though prediction gets tricky) or you need to use some sort of GLM, but the details will depend on what you're trying to model. – dimitriy Apr 08 '15 at 18:27
  • The Pearson's correlation coefficient is about 0.84392 between gross_area and price – Marc Lamberti Apr 08 '15 at 18:37
  • I don't mean statistically. Why would price change with area? What is the mechanism or model? – dimitriy Apr 08 '15 at 18:48
  • Because as the gross area increases, the prices increases as well.. But I don't think that I understand what do you mean exactly :( – Marc Lamberti Apr 08 '15 at 18:53
  • I think the taking of z-scores is undesirable here. I'd be inclined to think more along the lines of GLMs, such as a gamma GLM with log-link (or perhaps a linear regression with a transformed response). – Glen_b Apr 08 '15 at 22:22
  • 1
    See the discussion in [this answer](http://stats.stackexchange.com/questions/109708/two-simple-questions-regarding-glm/109782#109782), [this one](http://stats.stackexchange.com/questions/136740/exponential-equation-fitting/136743#136743) and [this one](http://stats.stackexchange.com/questions/47870/exponent-for-non-linear-regression-in-r/108989#108989), and perhaps also [this](http://stats.stackexchange.com/questions/106395/linear-regression-makes-impossible-predictions/106404#106404). – Glen_b Apr 10 '15 at 09:42

1 Answers1

14

Linear regression does not respect the bounds of 0. It's linear, always and everywhere. It may not be appropriate for values that need to be close to 0 but are strictly positive.

One way to manage this, particularly in the case of price, is to use the natural log of price.

RegressForward
  • 1,254
  • 7
  • 13
  • You mean applying log(price) before computing the prediction ? – Marc Lamberti Apr 08 '15 at 18:34
  • Yes, looks like @Dimitiry has the same idea. – RegressForward Apr 08 '15 at 18:37
  • E[price | area] != exp(predicted log price | area), so prediction requires doing something like Duan smearing. – dimitriy Apr 08 '15 at 18:46
  • I actually applied a natural log on price and effectively I don't have any problem anymore, but I really want to understand why actually... thank a lot. Is there any relation to the distribution of my dependent variable or something like that ? – Marc Lamberti Apr 08 '15 at 18:51
  • 8
    There's no mystery here. A straight line $y = a + bx$ will predict negative $y$ for _some_ $x$ unless $a > 0$ and $b = 0$. That can bite within the range of your data, and is biting you. Using logarithmic scale seems natural not just because you reasonably prefer positive predictions, but also because curvature is evident too from your graph, and indeed uneven scatter. Whether you are better off with log transformation or a GLM with logarithmic link is a more subtle question. – Nick Cox Apr 08 '15 at 19:00
  • Thank you for your explanation, Just about your last sentence, I have seen on this post http://stats.stackexchange.com/questions/48594/purpose-of-the-link-function-in-generalized-linear-model that "Linear regression assumes that the response variable is normally distributed." Is it one of the reason why I used log scale or nothing related on (and it is just because of the shape of my data)? – Marc Lamberti Apr 08 '15 at 19:15
  • 2
    There are also many posts explaining why that statement (precisely that within "") is **wrong**. At most, there are some nice properties if the **error** distribution is normal. Suppose $y = a + bx + \epsilon$ and the error $\epsilon$ is uniformly distributed on $[-c, c]$. Is regression then invalid? Not so. Functional form is primary; marginal or conditional distribution is secondary. – Nick Cox Apr 08 '15 at 19:37
  • 4
    marcL -- There are three main problems with the model you fitted: (1) the relationship isn't linear; (2) the model you chose doesn't respect a known bound; (3) the spread isn't constant. The fact that the transformation would also make the conditional distribution less skew would be a bonus, rather than a requirement. (A normal assumption in regression comes in when you're trying to do inference - testing, confidence intervals - but other things can be done if you don't expect to have normality.) ... (ctd) – Glen_b Apr 08 '15 at 23:08
  • 4
    (ctd) ... However, if you take logs and want a *mean* prediction back on the original scale, you can't just exponentiate the mean on the log scale. There are several options there (though if I primarily wanted mean predictions I'd be inclined to fit a gamma GLM with log link; a very similar model but the calculation of original scale means is more direct). If I primarily wanted a *prediction interval*, the linear model in the logs is easier. – Glen_b Apr 08 '15 at 23:08