1

I have a dataset of roughly 2500 people. One of the variables I have is the number of days people shared different types of content on a social media platform. The most frequent value is 0, with low means and variance about double the mean. The range is also extremely high. I would like to test whether or not a number of variables (mostly nominal and ordinal) have an effect on the number of days people share data. I checked to see if my data fit a poisson distribution but it does not. Can someone please recommend a test I could use for this overdispersed zero-inflated data? Or a transformation?

EDIT: I was actually not looking to build a regression model, but test for significant differences (similar to a t-test). I've tried recoding the shared days into a binary variable (0 days shared and 1 or more days shared) but did not see any differences.

Thanks, Geo

Geo
  • 11
  • 3
  • The Poisson assumption would relate to the conditional distribution, not the marginal distribution. The marginal distribution would be a mixture of conditional distributions and so would typically be expected to have variance higher than the mean *even if the Poisson model were exactly correct*. Normally you can't reasonably assess the suitability of the Poisson until *after* you've fitted your covariates. (This point is discussed in dozens of answers on site.).... consider even a single binary predictor, $x$. $Y$ may be Poisson($\lambda_0$) when $x=0$ and Poisson($\lambda_1$) when $x=1$,..ctd – Glen_b Jul 12 '17 at 02:13
  • ctd... with both subsets having sample mean close to sample variance, but when you smoosh all the $y$ values together the variance will be larger (perhaps much larger) than the mean. I just did an example in R where $\lambda_0$ and $\lambda_1$ were about 5.49 and 14.9 and both sample means and variances where close to those values within each subgroup ($x=0$: (5.6,5.4) and $x=1$ (15.0,15.1) but not when you ignored the conditioning on $x$ (11.2,32.6), where the variance was nearl triple the mean. That's the kind of thing you expect to see when the model is *exactly* correct. – Glen_b Jul 12 '17 at 02:24
  • Potential duplicates: https://stats.stackexchange.com/questions/183801/count-data-that-does-not-follow-poisson-distribution https://stats.stackexchange.com/questions/51367/statistical-model-from-distributions but see also https://stats.stackexchange.com/questions/194350/what-type-of-data-would-have-non-normal-errors which has relevant comments (many more posts exist which make similar points) – Glen_b Jul 12 '17 at 02:28
  • If you can clearly see that the dispersion is clearly too large conditionally (which you can assess from GLM output) then you may have something to think about along these lines (such as perhaps a zero-inflated model) – Glen_b Jul 12 '17 at 02:40
  • The only way to test the variables you describe would be in the context of a model you built. So the threads on the models for this type of situation are the answer here. This remains a duplicate. – gung - Reinstate Monica Jul 12 '17 at 16:41

0 Answers0