Using empirical Bayesian estimation (Gamma-Poisson) to analyze high arrival counts (n ~= 5000)

Question

Here's a problem I'm currently working on, as well as the empirical Bayesian approach I'm using. I'd like to make sure my approach is grounded in solid statistical theory.

I have a set of entities $e=e_1,e_2,...,e_N$, as well as arrival counts at different time periods $t$ for each entity $e_i$, denoted by $y_{e_i, t}$. Here is a histogram of these arrival counts for all entities across all time periods in my data set.

The pink line is $x=5000$. Note that there are no entries in the first bin -- this is because my dataset omits all entities whose arrival counts fall below a certain threshold (for simplicity let's say 3000 arrivals). The median for this data also falls around 5000.

I am interested in identifying entities in this dataset whose recent arrival counts are accelerating rapidly within a recent time window. Here is an example of an entity whose arrival counts have accelerated, peaked, and subsequently dropped off.

For this chart, I'd want to highlight this entity around the second x-tick, where its counts increase from 5000 to about 15000.

I believe empirical Bayesian estimation using a Gamma-Poisson model will work well for this problem. It's best if I walk through my algorithm:

For each entity $e_i$, use $Gamma(k=5000, \theta=1)$ as a prior distribution for arrival counts within a set of time periods $T = t_a, ... t_b$. Remember that 5000 is the empirical median for all arrival counts.
Observe arrival counts $y_{e_i, t}$ for $t \in T$. I propose that $y_{e_i, t} \sim Poisson(\lambda_{e_i, T})$; that is, arrival counts are generated by a Poisson that is stationary over $T$.
By conjugacy we can obtain the posterior of arrival counts for $e_i$ over $T$. It is $$p(\lambda_{e_i, T}~|~y_{e_i, t_a},...y_{e_i, t_b}) \sim Gamma(k + \sum_{t \in T} y_{e_i, t}, \frac{\theta}{|T|\theta + 1})$$ where $|T|$ is the number of time periods.
I then observe an arrival count $y_{e_i,t_{b+1}}$. This is the next arrival count for entity ${e_i}$ after the time period $T$.
Compute a z-score for this arrival count using the posterior distribution. Call this $z_{e_i,T}$.

We can then sort entities according to their z-scores. Entities with the highest z-scores have deviated the most from their estimated posterior; I argue these entities have the fastest-accelerating arrival counts.

Here's a list of questions I'd like to answer:

First, and most importantly: have I made any glaring mistakes?
Should I model arrival counts using a different distribution? Would you use $Gamma(5000, 1)$ as a prior?
Is there an easier approach that incorporates recent observations and uses prior knowledge about arrival counts?

score 3 · Answer 1 · answered Aug 25 '16 at 06:53

As a first observation, your z-scores are not going to give you what you want. A large z-score tells you that the new arrival count is anomalously large, not that the arrival curve is accelerating.

Secondly, I would strongly advise you start with a simpler approach. There are a few possibilities for this 'simpler approach', but here's the one I'd recommend:

Smooth your arrival curve using EWMA with some small half-life, then take the diff. This smoothed-diff is a one-sided way to approximate the derivative at some timescale.
Smooth-diff again to get the second derivative.
Rank the second derivative values cross-sectionally.

Simple approaches like these are great because they take a few minutes to implement and they get you 70% of the performance of a more complex model. You might not end up using it in production, but a) it gives you something to fall back on if all else fails and b) it allows you to build out the rest of your data-processing pipeline.

Thanks for your response! I appreciate it. We've had our own "70% performant" approach for a while now, and the entire data pipeline has been built for months at this point. I'm looking for an evaluation of the statistical soundness of my approach. If it doesn't measure arrival acceleration, I'd like to edit it to better model that. — achompas, Aug 25 '16 at 16:00

Using empirical Bayesian estimation (Gamma-Poisson) to analyze high arrival counts (n ~= 5000)

1 Answers1