1

I am trying to predict the number of events happening with the next time window, given the current values of some input variables. I am trying to pick up a good family of distributions to describe this data. Given the nature of the problem, Poisson distribution seems like a good idea.

When I plot histograms of data for different values of input variables, I see that there's a lot of mass at 0 (around 90-95% if I don't condition on anything), and the distribution for non-zero values looks like an exponential: the probability mass function is gradually decreasing. That already makes the Poisson distribution assumption questionable.

Furthermore, for the Poisson distribution the mean equals variance. In my case, however, if I plot the mean and variance for different values of conditional variables, I see a linear relation between them: variance = const * mean, where const is very high. Hence, again, Poisson distribution does not seem to be a good choice to fit this data. Which family of distributions would you suggest?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Ulysses
  • 419
  • 4
  • 11
  • Would the gamma distribution work for this - mean = $k\theta$ and variance = $k\theta^2$? i.e. mean = variance * $\theta$. $\theta$ is the scale parameter and $k$ is the shape. – Eumenedies Jun 01 '17 at 09:17
  • 4
    Your question is not clear. For one thing, exponential distribution **is for positive data**. Next, see https://en.wikipedia.org/wiki/Exponential_distribution for the exponential distribution mean is **not** equal to variance. Maybe you have some other distribution in mind? It would be better if you reformulated, explaining your practical situation, what measurements do your variable represent, without using statistical jargon. – kjetil b halvorsen Jun 01 '17 at 09:18
  • @kjetilbhalvorsen: hope it is better now – Ulysses Jun 01 '17 at 10:07
  • Do you have count data? that is, all the observations are nonnegative integers. With so many zero counts, you could look into zero-inflated models (search this site). With variance=const*mean, poisson distribution will not fit, but poisson regression can still be used, with corrections for overdispersion. See for example https://stats.stackexchange.com/questions/20826/poisson-or-quasi-poisson-in-a-regression-with-count-data-and-overdispersion – kjetil b halvorsen Jun 01 '17 at 10:57
  • @kjetilbhalvorsen: that's a count data indeed. Essentially, I'm tring to predict the total size of packages arriving within the next time window of fixed length, so it's like a Poisson process with random positive integer increments. – Ulysses Jun 01 '17 at 12:08
  • 1
    Then I would start out with (possibly zero-inflated) and overdispersed Poisson regression. See https://stats.stackexchange.com/questions/45262/zero-inflated-count-models-in-r-what-is-the-real-advantage – kjetil b halvorsen Jun 01 '17 at 12:12

1 Answers1

2

You have count data, that is, the observations are non-negative integers which can be modelled as some kind of counting process. The high proportion of zeros can be taken care of by some kind of zero-inflated model, see for example Zero-inflated count models in R: what is the real advantage?

You also say that $\text{variance}=\text{const} \cdot \text{mean}$ with const much larger than one. That is a situation modeled perfectly by poisson regression, but with some correction for overdispersion. For any glm (generalized linear model) the mean-variance relationship is fundamental to efficient estimation, since this is what determines the weight used in the IRLS (iteratively reweighted least squares, see Can you give a simple intuitive explanation of IRLS method to find the MLE of a GLM?) estimation algorithm. But weights are determined by the relative values of the variance, not the absolute values. So proportionality to the mean, not equality, is sufficient for the poisson model to give optimal weights. But then, the standard errors will be wrong, which is what is corrected for by overdispersion corrections. There is a lot of posts here about poisson regression and overdispersion, so just search the site.

Another possibility is to use a negative binomial distribution family. That is also covered by a lot of posts, see for instance Difference between binomial, negative binomial and Poisson regression

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467