poisson regression and discretization

Question

I have data which is a natural fit for Poisson regression but I'm not sure how to correctly discretize the data into bins in a canonical or "best" way. One intuition is that I should use "small" units of time such that in each bin there is either zero or one event but I'd like to know if there is any mathematical justification for this. Here is a very simple example to illustrate my confusion regarding binning the event data.

Consider the following data and two possible Poisson arrival models (the numerical values for the models are the expected arrivals, "lambda"):

t = 0, data = 56, model_1 = 54, model_2 = 40

t = 1, data = 40, model_1 = 38, model_2 = 56

t = 2, data = 24, model_1 = 26, model_2 = 10

t = 3, data = 8, model_1 = 10, model_2 = 26

A few simple calculations show that:

1.  model_1 > model_2 when aggregated into one interval

2.  model_2 > model_1 when aggregated into two intervals

3. model_1 > model_2 when aggregated into all four intervals

I think one could construct examples of arbitrary depth of switching in this fashion.

Questions:

What is the proper way to compare two poisson regression models on a test set? What is the canonical way to discretize the event process for said comparison? What is the mathematical justification of said "canonical way" (if it exists)? Is the discretization whereby each bin contains only zeros or one events in any way either canonical or preferred, and if so, what is the mathematical reasoning?

If you have to discretize it, it's not a natural fir for Poisson regression, which assumes integer responses. — jbowman, Dec 28 '19 at 04:25

kjetil b halvorsen · Answer 1 · 2019-12-29T00:03:34.333

Lets look first at a simpler situation without covariables $x_i$. So say we have a Poisson process with constant rate $\lambda$ per hour, say. If we count the number of events per hour for $n$ hours, we get $Y_1, \dotsc, Y_n$ independently distributed as $\mathcal{Pois}(\lambda=e^\theta)$, where we have written this as for a Poisson regression with log link function. Then the likelihood function becomes $$ L(\theta)=\prod_i e^{-e^\theta} \frac {(e^\theta)^{y_i}}{y_i!} $$ and taking logarithms and leaving out terms not depending on $\theta$ this gives the loglikelihood function $$\ell(\theta)=-ne^\theta + \theta n \bar{y}.$$ Now suppose we count over shorter time intervals, dividing each hour into $p$ equal intervals. That gives $np$ independent count $Y^*_{11}, \dotsc, Y^*_{1p}, \dotsc, Y^*_{np}$ each distributed $\mathcal{Pois}(\lambda/p=e^{\theta-\log p})$. Clearly $\sum_{j=1}^p y^*_{ij}=y_i$. Now calculate the likelihood based on the $y^*_{ij}$. $$ L^*(\theta)=\prod_i \prod_j e^{-e^{\theta-\log p}} \frac{(e^{\theta-\log p})^{y_{ij}}}{y_{ij}!} $$ Now taking logarithms and dropping all terms not depending on the parameter $\theta$ (see What does "likelihood is only defined up to a multiplicative constant of proportionality" mean in practice?), we are lead to the same loglikelihood function as before, $\ell^*(\theta)=\ell(\theta)$. So, how you bin the $n$ hours into counting subintervals do not matter.

Now, do this again with covariables, and binning in such a way that the covariables are constant over the counting intervals. The result will be the same, the binning do not matter.

poisson regression and discretization

1 Answers1