4

All,

I am trying to create a regression model where the (continuous) outcome is multimodal:

enter image description here

The outcome is the retail price of a certain product, and prices tend to fall around distinct amounts (750, 1000, 1250, 1500, etc). There are, however, a few prices in between so the prices are not distinct.

I have run a linear model with satisfying results, though the extra prices between the modes give me pause. I also tried binning the prices down to a few groups representing the modes and it works somewhat well.

Is there a better or worse way to model this? is there some sort of better or worse methodology for binning the outcome?

Thank you

Macro
  • 40,561
  • 8
  • 143
  • 148
PeteL
  • 41
  • 1
  • 2
  • Hey Pete, I added the plot for you. It may be a little bit more informative if you could lower the bin size (using, e.g. `breaks=50` in your call to `hist()` in `R`) so we can really see the shape of the distribution. Also, I agree with what Peter Flom says below, which is similar to what I said to you when we were talking about this earlier, so it may be helpful to see a histogram of the residuals of an ordinary linear model to see whether Peter's answer does solve your problem. – Macro Oct 19 '12 at 21:24
  • 1
    In line w/ Peter's point (below) re the residuals, but not the outcome, needing to be normally distributed, this: [what-if-residuals-are-normally-distributed-but-y-is-not](http://stats.stackexchange.com/questions/12262/) may be helpful. – gung - Reinstate Monica Oct 19 '12 at 21:32

1 Answers1

2

OLS regression does not assume that the dependent variable is normally distributed, nor even unimodal. It makes assumptions about the error term, as estimated by the residuals.

Many variables exhibit "clumping" at certain round numbers and this is not necessarily problematic for regular regression.

Categorizing, or binning, continuous data is very rarely a good idea. However, if there are very few prices between the round numbers, this may be a case where it does make sense. If you do this, then the OLS model should no longer be used, but ordinal logistic regression (or some other ordinal model) instead.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • Peter, based on the description, it sounds like there is some inherent discreteness in the data - "prices tend to fall around distinct amounts (750, 1000, 1250, 1500, etc)" - which makes it less likely that the errors are approximately normal (unless there are only discrete predictors or something like that). If you did not want to categorize the data and still wanted to model this with a regression approach, what would you suggest? – Macro Oct 19 '12 at 21:27
  • Hi @Macro . I guess some sort of robust regression, although I can't say, without research, which type would be best. – Peter Flom Oct 19 '12 at 21:39
  • 1
    I think this is a good answer, but I also think that it neatly avoids the reality: we do not do linear regression for multimodal data (or at least it's not generally suggested). Can this be explained using the distribution of residuals? It would seem to me that the residuals are no longer normally distributed when we have multiple modes – information_interchange May 10 '19 at 15:14