3

I am trying to move from a current parametric Poisson regression to a non-parametric count regression and would appreciate views on the best way to do this.

Current state of analysis

I have a time series of 8,000 items x 4 years of data = 32,000 data points. Let's call these $\{Y_{i,t}\}$. Each item is a discrete random variable, and so I have modelled it with Poisson regression.

I have one key regressor of interest (lets call it $K$) and about 50 other variables (lets call these $\mathbf{x}$).

Currently, I model $\lambda$, the log of the expected value of the count data as $\lambda(Y_{i,t}) \propto C_KK_{i,t} + \mathbf{C_x}\cdot\mathbf{x}_{i,t} + \mathbf{C_{Kx}} \cdot (K\mathbf{x}_{i,t}) $

That is, a linear term in my key variable, linear in my other variables, and then an interaction term, with some set of coefficients that I am trying to find.

At present, I assume that:

  • The relationship with the regressors is a linear main effect plus a linear interaction, and
  • The coefficients are constant over time

I require that the model be smooth because I am ultimately interested in $\partial \lambda /\partial K$.

What I am trying to do

I want to modify the analysis in order to:

  • Allow for non-linearities (both in the main effects and the interactions), and
  • Allow for potential time variation in the constants

What I think my options are

As far as I can see, there's a few ways to go here. These include:

  1. Stay within a parametric framework, but adding polynomial terms to the main effects and interactions, eg (x^2 terms) and to allow for different constants for different time periods
  2. Stay within a parametric framework, but adding polynomial terms to the main effects and interactions, eg (x^2 terms) and include time as a variable in the regression
  3. Add polynomial terms (as in the first option) but use a Bayesian hierarchical model to link the coefficients in time (eg through a random walk) - the extra structure imposed on the model would help to regularise it.
  4. Using splines to model the main effects plus interactions. Potentially I could stay within the safety of the glm function in R (eg by using the ns function).
  5. A variant on 4, use the generalised additive model package (gam) in R to construct a smooth function of my variables.
  6. Doing something more sophisticated and use eg a neural network to model $\lambda$. This would potentially increase the number of coefficients to be estimated significantly, but (as I understand it) only the terms $K_{i,t}$, $ \mathbf{x}_{i,t}$ and $K\mathbf{x}_{i,t}$ would need to be included, with the neural network figuring out the non-linearities. This would come with the downsides of having to determine the right neural networks structure, eg through cross validation
  7. Do something even more sophisticated and use a Gaussian Process to model $\lambda$. This has the advantage that the number of parameters would be small (assuming use of just the squared exponential kernel), but the model would be extremely computationally demanding given N=32,000 and potentially 50 other variables. This would require one of the various GP approximation methods, and may be infeasible.

My concern with staying parametric is the potential combinatorial explosion of parameters, eg there will be terms in $K$, $K\cdot x_i$, $K^2$, $K^2 \cdot x_i$, $K \cdot x_i^2$, $K^2 \cdot x_i^2$ etc. So non-parametric methods are appealing. The question is: what is the right one given the fact that: a) I have time series data, and b) I have ~32,000 data points to fit based on ~50 regressors.

Grateful for any perspectives on what the best approach might be. Is there anything here that I have missed that is worth exploring?

I am currently thinking that #5 (generalised additive models) might be the way to go.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
alpha137
  • 151
  • 4

0 Answers0