What to do when count data does not fit a Poisson distribution

Question

I'm a PhD stats student. I'm working with a data set of count data. It's counts of users who are involved in an n-way real time chat conversation. The # of users range from 1 to 6 and there are approx 300 pieces of data in the set.

My initial motivation was to understand if the data would fit a Poisson distribution, my thinking being if a good fit was found i could use this result for further inference.

To cut a long story short, I attempted to fit the data and the fit fails at a 0.05 significance level. Thus I can reject my hypothesis (that a Poisson distribution can be used to approximate the data set).

When i look at a density plot, i believe the reason why there is such a poor fit is that there are "too many recorded values for 2 users. A Poisson distribution would fit better with less values in this bin. However as i compiled the data myself I have no reason to believe there are outliers (i.e. conversations with 2 users which would be assigned to a higher or lower bin)

users <- c(1, 2, 2, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 4, 3, 3, 3, 1,
        2, 1, 1, 2, 4, 3, 2, 2, 1, 2, 3, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 3,
        2, 1, 2, 3, 2, 1, 2, 1, 2, 1, 1, 3, 1, 1, 1, 2, 2, 2, 3, 1, 2, 1, 2, 4, 4,
        3, 2, 2, 3, 4, 3, 3, 3, 1, 2, 4, 2, 3, 3, 2, 4, 3, 1, 2, 4, 1, 2, 2, 2, 1,
        1, 1, 2, 3, 2, 4, 5, 2, 2, 4, 2, 2, 3, 3, 3, 2, 2, 3, 1, 3, 1, 1, 1, 2, 3,
        6, 3, 3, 4, 2, 2, 2, 3, 1, 1, 1, 2, 2, 3, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,
        3, 3, 3, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 4, 3, 3, 2, 1, 2, 4, 1, 2, 1, 2, 2,
        2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 3, 1, 2, 1, 2, 3, 4, 2, 4, 3, 2, 2, 1,
        4, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 3, 3, 1, 1, 2, 1, 2, 1, 3, 3, 3, 3,
        4, 6, 6, 5, 5, 2, 2, 3, 3, 3, 2, 3, 3, 4, 2, 3, 1, 3, 3, 1, 3, 2, 1, 3, 3,
        2, 1, 3, 1, 3, 2, 1, 1, 1, 1, 3, 1, 3, 4, 1, 4, 1, 3, 2, 3, 6, 2, 2, 3, 2,
        1, 2, 2, 2, 2, 2, 1, 2, 3, 2, 2, 4, 2, 2, 2, 3, 2, 2, 5, 3, 2, 2, 3, 2, 2,
        2, 5, 2, 1, 4, 1, 2, 2, 6, 1, 3, 2)


tu.fit <- goodfit(users,type="poisson", method = "MinChisq")
summary(tu.fit)
 Goodness-of-fit test for poisson distribution

             X^2 df     P(> X^2)
Pearson 69.37891  5 1.379945e-13

At undergrad level statistics I was taught that count data can be modeled by a Poisson distribution, but they never taught what to do when count data do not fit.

I'm not tied to the premise that my count data should fit Poisson (or any other distribution for that matter). However, I wanted to explore the efficacy of whether I should transform my dataset and fit to another discrete distribution or try some other approach (KDE) instead. Or should I simply conclude that my data is not a good fit for Poisson (or any other distribution) and leave it at that?

(1) Zero as a value would appear *a priori* to be impossible, thereby ruling out the Poisson as a model. But why is $1$ a possible value? What does it mean for just one person to be involved in a one-way "conversation"? (2) What do you hope to accomplish by fitting any distribution to this univariate dataset? How would that help with "future inference"? Inference about what, exactly? (3) Have you noticed that `users-1` **is** consistent with a Poisson distribution? — whuber, Jun 18 '17 at 17:34
(1) Is a possible value in the following instance: A user sends a single line message (for what ever reason) No-one responds to the message / engages in further discussion. In that case this is a single user "group chat" (2) If a Poisson fit, where possible i can ask questions such at what is the probability the next chat will contain exact 2,3,4,5 users etc. (3) I'll admit that I did not spot users-1 was consistent with a Poisson Distribution. This seems interesting, I'll need to contemplate this further — Jonathan Dunne, Jun 18 '17 at 18:43
I guess you can try negative binomial distribution or something else discussed in this related post https://stats.stackexchange.com/questions/67385/what-is-the-appropriate-model-for-underdispersed-count-data — wonghang, Jun 19 '17 at 01:09
So i had a look at the dispersion rate and with a vale of 0.36 calculated using the dispersion test function from the AER package. Looking at this from the perspective of an under dispersed dataset, some comments in section 67385 discussed the remove of specific bin with high frequency. Other comments suggest some form of hurdle re-mediation. whubers comments that `users-1` when fitted is a good fit for a Poisson distribution (which is a form of hurdling). I suspect a common sense approach needs to be applied when dealing with under-dispersed data. — Jonathan Dunne, Jun 20 '17 at 10:58
@whuber i'd be happy to accept your comments as an answer to this question. — Jonathan Dunne, Jun 20 '17 at 10:59

score 8 · Accepted Answer · answered Jun 20 '17 at 13:40

Let's consider, somewhat simplistically, the natural history of a conversation:

One person initiates a conversation by sending a message out into the ether.
People respond. Each new (unique) respondent adds one to the count.
Responses to any message are random: whether an individual responds depends on whether
- They are aware of the message
- Currently have an opportunity to respond
- Are interested in responding.
Compared to the number of people who could receive messages, the number of messages initiated is relatively low. Thus
- Almost all individuals will be responding to one, or a small manageable numbers, of messages at any time.

Characteristics (3) and (4) suggest a Poisson distribution could be a good model for the number of people responding to any message at any time: that is, the count minus one. What we do not know, and might not be safe assuming is whether all messages have approximately the same Poisson parameter or whether those parameters vary appreciably.

A good starting point, then, would be to test whether the counts minus one fit a Poisson distribution. Alternatively, they might fit some overdispersed distribution comprised of a mixture of Poissons.

The maximum likelihood estimate of the Poisson parameter $\lambda$ is the mean of the counts (minus one), equal to $1.20$. (It is important to use the ML estimate for this calculation rather than the "MinChisq" estimate computed by vcd::goodfit: see https://stats.stackexchange.com/a/17148/919.) Multiplying the Poisson probabilities by the total number of users gives the expected numbers of user counts. Here they are compared to the actual counts:

          0   1  2  3 4 5
Expected 94 113 68 27 8 2
Actual   85 127 68 22 5 5

The fit looks close. It can be measured with the chi-squared statistic, $$\chi^2 = \frac{(85-94)^2}{94} + \frac{(127-113)^2}{113} + \cdots + \frac{(5-2)^2}{2} = 9.61.$$

The six terms in this sum measure the individual count discrepancies. They are

     0    1    2    3    4    5 
  0.88 1.79 0.00 0.93 1.18 4.82

Values close to $1$ signify good agreement. Only the last value, $4.82$, is large. That is due to the small expected value of $2$ for a count of $5$. Typically, expected values less than $5$ are believed to lead to some unreliability in the traditional $\chi^2$ test: here, we should take the $\chi^2$ statistic to perhaps be a little inflated due to the small expected number of six-way conversations.

Nevertheless, this $\chi^2$ statistic is not terribly high: under the hypothesized unvarying Poisson distribution, this statistic would approximately follow a $\chi^2(5)$ distribution. That distribution tells us a value this high occurs almost nine percent of the time. We conclude there is little evidence of a departure from a constant Poisson distribution.

Incidentally, a plot of the data--in the sequence given--does hint at a variation in the counts. On average they increase a little from beginning to end, as the Lowess smooth on this plot suggests:

Thus, the chi-squared test of Poisson distribution should not be the last word: it should only be considered the beginning of a more detailed analysis.

Here is the R code used to perform the calculations and create the figure.

counts <- table(users-1)
mu <- mean(users-1)
expected <- dpois(as.numeric(names(counts)), mu) * length(users)
x <- (counts - expected)^2 / expected
print(round(x, 2)) # Terms in the chi-squared statistic
print(rbind(Expected = round(expected, 0), Actual=counts)) # Compare expected to actual

library(ggplot2)
X <- data.frame(Index=1:length(users), Count=users)
g <- ggplot(X, aes(Index, Count)) + geom_smooth(size=2) + geom_point(size=2, alpha=1/2)
print(g)

@Antoni The line of reasoning is reflected in my first comment to the question. In between writing parts (1) and (2) of that comment, a model like the one outlined in this answer occurred to me, so I quickly ran the chi-squared test against the data and included part (3) of the comment. — whuber, Jun 20 '17 at 16:34

What to do when count data does not fit a Poisson distribution

1 Answers1

Linked