Linear Regression makes impossible predictions

Question

I have created a multiple linear regression model to predict prices and about 17% of the predicted prices have come out to be negative. Is there a way to correct for this error, or does it mean that the independent variables are not good predictors? The R^2 coefficient is 0.865.

A negative predicted price is not necessarily an error. We have no basis to expect perfect predictions. Although such a model should give one pause--clearly it does not fully or correctly capture the relationship between prices and the explanatory variables--it could still have great predictive value. (Nevertheless, I have to agree with @Glen_b's answer suggesting that you consider using a more realistic model.) — whuber, Jul 11 '14 at 12:51

Glen_b · Accepted Answer · 2014-07-11T05:38:23.067

It's not the predictors (IVs) that are the problem, but the analysis.

Indeed, the mere existence of an impossible part of the range of the response (you can't have negative prices) would be a hint to consider something other than multiple regression (at least on the untransformed variable) - since it clearly can have negative predictions.

There are a number of ways of dealing with non-negative variables, but two fairly simple approaches might be worth considering:

modelling log-price; this is a common strategy with price-like variables in economics
using generalized linear models (GLMs). A gamma-model with a log-link would be quite similar to modelling log-price, but the model would be for the expected price rather than expected log-price. This may have some advantages. If you need the relationship with the predictors to be linear in actual price, this can be done (identity link), but a log link for this sort of data would be more common.

As Bill mentions in comments, there's an issue with log-price when you transform back to original units - you no longer have a model for the expectation, but for the median. That said, if you assume normality on the log-scale you can easily compute an ML or MOM estimate of the mean on the original scale.

(And if you can't assume normality on the log-scale you can still approximate the expectation on the original scale via a Taylor expansion.)

Prediction intervals for a new observation, however, transform just fine.

Also, since OP is interested in prediction, he is going to want to look at posts on the re-transformation problem. — Bill, Jul 10 '14 at 17:17

score 2 · Answer 2 · edited Apr 13 '17 at 12:44

2

One option would be to fit a generalized linear model with a different reference distribution that is bounded at zero (and probably more appropriate than the normal distribution, depending on what kind of prices you're modeling). For example, the negative binomial distribution suits discrete distributions of counts; this could work if your data are whole numbers representing counts of currency units. The gamma distribution is a continuous alternative (with an otherwise similar shape) that may be more suitable, as it's much more commonly used for financial data (see "Real-life examples of common distributions").

Negative predictions do not reflect badly on the utility of your predictors.

edited Apr 13 '17 at 12:44

Community

1

answered Jul 09 '14 at 22:44

Nick Stauner

11,558
5
47
105

7

It doesn't strike me as valid to say that \$12 is a count variable with 12 currency units as the realization. You probably want something like Gamma. – gung - Reinstate Monica Jul 09 '14 at 22:47
1

@gung: good to know. Is there a simple explanation for your reasoning, or somewhere I should look for a not-so-simple explanation? I see from [Greg Snow's answer here](http://stats.stackexchange.com/a/33797/32036) that, 'Conceptually [the negative binomial distribution] is the number of "failures" before k "successes".' That sure doesn't seem right, but why does it matter? (Good material for a new question...?) – Nick Stauner Jul 09 '14 at 22:51
2

I just don't think it's meaningful to call it a *count* in the sense that the negative binomial is a distribution of counts. Gamma is a continuous non-negative distribution and is often used for things like that. When you aren't right up against 0, it can look fairly normal, which can be a plus when people think the data ought to be normal. – gung - Reinstate Monica Jul 09 '14 at 22:54
@gung: But any real amount of currency (i.e., disallowing fractional values) isn't really continuous – $12.00 is exactly that – more zeros would be meaningless, or would at least require abstraction from the reality of physical currency (though maybe that's commonplace since the advent of credit). – Nick Stauner Jul 09 '14 at 22:59
2

All counts are discrete & non-negative, but not all non-negative discrete variables are counts. It may be that you can use the NB successfully in this situation, but it feels a bit like shoehorning to me. – gung - Reinstate Monica Jul 09 '14 at 23:10
True enough I'm sure, but at least with **physical** money, I don't see why one couldn't call it a count of discrete units. Still, that might very well be shoehorning (again, especially in the modern financial age), and I have no reason to recommend the NB over the gamma distribution. – Nick Stauner Jul 09 '14 at 23:16
3

12 surgical complications at a given hospital in May are 12 different discrete events; 12 accidents at a given intersection in May are 12 different discrete events; etc.; \$12 for a given purchase is 1 event. Yes, you are counting out dollars, but I don't see it as really being the same ontologically. – gung - Reinstate Monica Jul 09 '14 at 23:30
A count can *only* take integer values. In this case, the $ amount may have been rounded to the nearest dollar, but it could be in between. Of course, if you get down to it, nearly every numeric variable would be a count, if you measured finely enough. I weigh 190 pounds. Oops, I meant 1,202,201,012,491 atoms (or whatever). – Peter Flom Jul 10 '14 at 00:39
I did say "if your data are whole numbers", and the question didn't say they're not (prices can be measured in cents, though this is probably rare). Either way, the buck stops at cents, which is a far cry from expressing pounds as quarks. Furthermore, I edited in the gamma distribution as a more popular alternative, and the comments provide the evidently popular rationale (which I still find a little unsatisfying). Given a downvote, the tangential focus on my initial example is starting to get a little silly. I'd remove it entirely if it wouldn't confuse the context of this comment thread... – Nick Stauner Jul 10 '14 at 00:48
One critical difference between counts and discretized units is that true counts are unit-free. With money, I should expect to get the same outcomes whether I measure in cents or dollars but I won't get that if I treat it as any of the usual models for a count (and indeed that's a big hint to look at the log scale). – Glen_b May 29 '19 at 23:23

Linear Regression makes impossible predictions

2 Answers2

Linked

Related