10

I was doing a kaggle playground problem with the Ames house pricing dataset and found the sale price to be heavily skewed in terms of occurring frequency. enter image description here

One tutorial points out that skewed data are bad for regression modeling, and that one should "unskew" the data by taking natural log. All of these were said without justification why this should be so. To me, the skewness of the data is part of the data and should not be tampered with to avoid situations of overfitting.

Am I wrong on this? Can anyone explain the reason why unskewing is a valid practice and what effects it would have in terms of the error rate?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Chester Cheng
  • 299
  • 2
  • 12
  • 3
    A general term here is transformation. I have added that to your tags and there are >1000 threads here, many of them relevant. In this case working with log price rather than price is attractive on many grounds and the fact that log of price is typically less skewed than price isn't even the most important. Indeed, generalised linear models would allow you to use logarithmic link so that you have it both ways, fit on log scale but get predictions on the original scale. – Nick Cox Aug 22 '17 at 11:21
  • I wouldn't assume familiarity with the Ames house pricing dataset. I would recommend stating what your goal is: is price a response or outcome you are trying to model or is it a predictor? (If you are coming from a machine learning background your terminology may differ.) – Nick Cox Aug 22 '17 at 11:23
  • 1
    What you're looking at here is the distribution of the data. The distributional assumptions of linear regression apply specifically to the distribution of the _errors_ in the data, that is the part not explained by the linear model. It is only problematic if those errors are skewed (or otherwise deviate from a Normal distribution). The data can have any distribution. See also [this question](https://stats.stackexchange.com/questions/247986/does-regression-work-on-data-that-isnt-normally-distributed/247988#247988). – Ruben van Bergen Aug 22 '17 at 11:27
  • @NickCox Thanks for the edit. The goal is to predict sale prices given a set of explanatory variables. I will go check out the generalized linear model you talked about – Chester Cheng Aug 22 '17 at 11:46
  • See also http://blog.stata.com/2011/08/22/use-poisson-rather-than-regress-tell-a-friend/ – Nick Cox Aug 22 '17 at 11:49
  • @NickCox do you want to make your comment into an answer? I think it's really close to an answer now. I could write an answer, but I don't want to step on any toes. – Peter Flom Aug 22 '17 at 12:07
  • @Peter Flom You're not stepping on my toes. Feel free to write an answer and use any or all of my comments. – Nick Cox Aug 22 '17 at 12:14
  • 1
    The issue with regression is really about the conditional distribution of the response, not the marginal distribution -- one might be skewed while the other is not. Often there's little point in considering the shape of the marginal distribution. But more important than the shape of the distribution is getting the relationship between the variables right; that's the main reason to transform (typically to achieve near-linearity); the second-best reason is to make the variance close to constant. – Glen_b Aug 22 '17 at 12:19
  • My comments are completely in line with Glen_b's; the standard econometric line, represented in these comments, that plain linear regression usually works better than you fear doesn't quite catch the point that it may not offer the best functional form. I'd add additivity to make (linearity, additivity, equal scatter) the three desirables of most importance. – Nick Cox Aug 22 '17 at 12:28
  • These issues are rather thoroughly discussed at https://stats.stackexchange.com/questions/298, which strikes me as being exactly the same question. – whuber Aug 22 '17 at 17:47

1 Answers1

11

Nick Cox makes many good points in his comments. Let me put some of them (and some of my own) into answer format:

First, ordinary least squares regression makes no assumptions about the dependent variable being normally distributed; it make assumptions about the errors being normal and the errors are estimated by the residuals. However, when the dependent variable is as skewed as yours is, the residuals usually will be too.

Second, the emphasis on transformation for statistical reasons that you find in many introductory books is because the book wants to show how a person can use OLS regression in different situations (and, unfortunately, it's true that some professors in non-statistics courses don't know about alternatives). In older books, it may also be because some of the alternative methods were too computer intensive to be usable.

Third, I think data should be transformed for substantive reasons, not statistical ones. Here, and for price data more generally, it often makes sense to take the log. Two reasons are that 1) People often think about prices in multiplicative terms rather than additive ones - the difference between \$2,000,000 and \$2,001,000 is really small. The difference between \$2,000 and \$2,100 is much bigger. 2) When you take logs, you can't get a negative predicted price.

Fourth, if you decide not to transform (for some reason) then there are methods that do not assume that the residuals are normal. Two prominent ones are quantile regression and robust regression.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    " However, when the dependent variable is as skewed as yours is, the residuals usually will be too." makes implicit assumptions. I suspect this might be misleading. – Roland Aug 22 '17 at 14:19
  • 1
    The constraint is on conditional density and not the marginal density, no? The marginal density can be skewed like here but does it imply the same for conditional? – user8463728 Aug 22 '17 at 14:39
  • 1
    @Roland Implicit assumptions--of course. Misleading? Perhaps not so much. I can recall only one kind of situation that has appeared among the tens of thousands of regression datasets on this site where the residuals might not be skewed despite strong skewness in the regressors: the response includes independent measurement error whose variance is much greater than the variance of the model error. BTW, these issues (and more) are covered much more thoroughly at https://stats.stackexchange.com/questions/298 – whuber Aug 22 '17 at 17:46