2

Let's assume that I have a set of predictors and a non-negative integer resulting variable (number of events). All observations are repeated few times (it means that all predictors have the same values more than once). I need to predict an average number of events for every possible combination of predictor's values. I combined all observations with the same predictors' values to one, and assigned an average number of events for all these observations to the new one.

Next, I built four different models - OLS, OLS with transformed resulting variable, hurdle Gamma GLM and, I don't know why, Poisson GLM. Surprisingly, Poisson was the best one. Since this is my final qualification thesis, I need some theoretical basis, but I can't figure one, I've been always thinking that Poisson regression assumes integer data. Hope, somebody could help.

Evgenii Nikitin
  • 407
  • 4
  • 12

2 Answers2

2

Take a look at the references in this answer for why a robust poisson model can be applied to non-integer data.

You can also motive it in your case by saying you're modeling a rate per covariate duplicate, as in this question with time. On the other hand, I don't really see a need to aggregate. The Poisson model gives you the expected value conditional on covariates, so it's OK to have duplicates with different outcomes but same covariates.

dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • First of all, thanks for the references, I'll take a look at them right now. Then, about why I need to aggregate. If I try to predict every observation outcome separately, obviously the average error is higher, there are some effects for duplicates that I don't consider in my model. Since I don't need predictions for every outcome, I don't want to report higher errors. Probably, I'm wrong here, so, please, correct me in this case. – Evgenii Nikitin May 06 '14 at 20:25
  • Indeed, I'm modeling a rate per duplicate, but I can't find anything about that in quoted question. Isn't it about an exposure variable? – Evgenii Nikitin May 06 '14 at 20:49
  • 1
    Your outcome is the *total* number for a set of duplicates. The exposure variable is the number of duplicates. – dimitriy May 06 '14 at 21:03
  • I am not sure I follow your logic about the duplicates. Let me think some more about this. – dimitriy May 06 '14 at 21:05
  • Oh, that's a nice idea. I'll try it right now. Thanks. – Evgenii Nikitin May 06 '14 at 21:06
  • @EvgeniiNikitin Aggregating (averaging) your counts probably does lower your error - but it does so erroneously since you're no longer taking into consideration the variability in response for a fixed set of $X$ – Affine May 06 '14 at 21:08
  • Well, I'll try to clarify it a bit. There is a TV channel that has some time slots for broadcasting direct response ads. People watch the TV, call and buy some staff like blenders. My goal is to find an optimal schedule for the following week. The schedule must be the same for the whole week. I formulated this problem as a MILP, but I need to find coefficients for my target function, they represent expected profit from broadcasting the ad of the given product at the given time. Since I construct a weekly schedule I don't need to predict an expected number of orders for every ad – Evgenii Nikitin May 06 '14 at 21:12
  • which would be unaccurate. Instead of this I want to predict an average number of orders for the whole week which is easier to do since it's more stable. I hope I clarified it a bit. – Evgenii Nikitin May 06 '14 at 21:13
  • @Affine It's true, but I don't need to take this variability in account. My goal is to predict an average number of orders for the whole week, like I explained above. – Evgenii Nikitin May 06 '14 at 21:14
  • Well, I just realized that I could predict not an average number but total for week just like Dimitriy suggested, and my problems about non-integer variables would dissappear. I'll try it, if anyone has any other suggestions, it would be nice. – Evgenii Nikitin May 06 '14 at 21:24
1

It seems your resulting variable is a non-negative integer, which is the support of the Poisson distribution. So your question doesn't really match the title of the post (I'm confused).

As far as theoretical justification, mostly what you'd have to do is show that the events (which have integer counts) follow a Poisson Process, which has a few simple properties.

  • 2
    Well, my variable was a non-negative integer before I combined observations and substituted the number of events with an AVERAGE number of events, which is obviously mostly non-integer. I don't need to predict the number of events in the every particular case since there are some random effects that I can't consider in my model, I just need an average number for a certain set of conditions (predictors). – Evgenii Nikitin May 06 '14 at 20:14
  • Sounds like your averaging is getting an estimator for the mean-parameter of a Poisson distribution. That parameter is real-valued, even though the actual realized value of its random variable is integer-valued. It's like a coin flip: the mean outcome is 0.5 (real-valued), but the only outcomes possible are {0,1}. When you average all the 0's and 1's you get an estimator for that underlying 0.5. – TheBigAmbiguous May 06 '14 at 22:37