2

I have a large data set (~17k data points) on which I would like to do a multiple regression. However, the explained variable has several instances of 0 (~6k).

In finding an appropriate model for this data set, would I be able:

  • to find the likelihood of the explained variable being positive using a logistic regression,
  • and then using just the positive explained variable data points to run a multiple regression?

When applying this model to new data, could I then find the chance of a positive result? And then the second model to find the expected amount, and then multiply the chance by the expected amount to get an average value? (Which won't be good for individual predictions but across a new large dataset would give a useful average value?)

I found a couple of papers that seemed to use this approach, but they didn't go into the details of the model. Would there be a confidence interval formula for the average over a large new dataset as well?

Thanks!

Karel Macek
  • 2,463
  • 11
  • 23
Tom Malkin
  • 123
  • 5
  • We need to know more about your problem. What are you trying to do? What is your research question? – StatsStudent Aug 04 '15 at 03:56
  • Thanks for replying StatsStudent. Unfortunately the details of the problem are confidential - I was more after whether the above approach is valid. It seems to under-estimate on new data quite a bit. How would the model be valid or invalid based on the problem? If you can't answer based on only the above info that's a shame but thanks for trying to help! :) – Tom Malkin Aug 04 '15 at 04:02
  • 1
    It sounds like you're asking about zero-inflated or hurdle models, which would be good search terms. There's some discussion of the two [here](http://stats.stackexchange.com/questions/81457/what-is-the-difference-between-zero-inflated-and-hurdle-distributions-models), but in your application the non-zero part is continuous rather than discrete (as the case for that question). If the response data are positive when they're not zero, you may want a zero-inflated gamma, or zero-inflated lognormal say. – Glen_b Aug 04 '15 at 04:03
  • Ah, I hadn't heard of those terms before Glen_b. Thanks for answering, I'll give those a read! – Tom Malkin Aug 04 '15 at 04:05
  • I was thinking the same thing as @Glen_b, but it would be impossible to tell if those would be appropriate approaches with such scant details of what the OP is actually trying to do. – StatsStudent Aug 04 '15 at 06:22
  • @StatsStudent I agree that we can't be sure if such things are suitable without more details – Glen_b Aug 04 '15 at 06:32
  • @Harlekuin -- do you want an answer along those lines, or would you prefer to clarify the question a little more? – Glen_b Aug 04 '15 at 06:33
  • @Glen_b To be honest guys I think the question was based from lack of knowledge. I'm now learning about two-part tests with models for zero inflated gamma and zero inflated log normal distributions. (Thanks Glen) I was on the right track given my semicontinuous dataset it seems; the Logistic regression was correct for the first part but given my conditional positive responses I was running just your run of the mill multiple linear regression which wasn't appropriate. – Tom Malkin Aug 04 '15 at 06:59
  • @Glen_b The likelihood function for the 2 part test does seem to be just a product of the probability function and the regression on the conditional data set – however I’m now looking at which positive data only regression model would be most appropriate for the second test. In anycase, thanks a lot for your help guys! – Tom Malkin Aug 04 '15 at 06:59
  • 1
    Many questions here come from a lack of knowledge -- that's rather the point of the site! On the model for the positive part, I'd suggest starting with a GLM, such as a gamma GLM. So that this isn't left unanswered I'll edit some of my comments into one. – Glen_b Aug 04 '15 at 11:03

1 Answers1

1

It sounds like you're asking about zero-inflated or hurdle models.

(These would also be good search terms.)

There's some discussion of the two here, but in your application the non-zero part is continuous rather than discrete (as with that question).

If the response data are positive when they're not zero, you may want a zero-inflated gamma, or zero-inflated lognormal say.

A hurdle model for continuous data is similar to a zero-inflated model, the main distinction is that the term 'hurdle model' tends to be used for the case where different predictors are used for identifying the 0/non-0 part than modelling the positive part.

For a zero-inflated gamma it's common to model using GLM. For a zero-inflated lognormal it's more typical to model the logs of the positive part (e.g. with a regression model).

Glen_b
  • 257,508
  • 32
  • 553
  • 939