3

I have a data set where the dependent or the response variable is a non-negative and integer variable and exhibits over-dispersion where the variance is greater than the mean. Below is the figure

enter image description here

There exists methods such as Poisson, Negative Binomial (NB) Regression etc to model this kind of data. However, I have been searching for some machine learning or regression tree methods for panel count data which can provide alternatives for Poisson or NB. There exists regression tree methods for panel data such as REEMtree: Regression trees with random effects for longitudinal (panel) data by Sela and Siminoff. But this procedure implements a linear mixed effects model along with the non-parametric procedure of tree building based on CART methodology by Breiman and I am not sure if this is right method to model count data.

My query is, does there exist any machine learning or regression tree procedure to model count/over-dispersed data?

user3571389
  • 275
  • 3
  • 6
  • What is your goal with this model? – dimitriy Nov 15 '16 at 18:27
  • So this is my depended variable and I have a set of independent variables and plan to use Poisson or negative binomial regression. The dependent variables is the number of discharge delayed days. I used machine learning techniques name regression trees in the past. A good think about regression trees is that they are able to show the interactions amongst the independent variables. Plus it can also be used as a very powerful exploratory analysis tool. – user3571389 Nov 15 '16 at 18:45
  • But what do you want to accomplish? Predict the next count outcome $y_{t+1}$ given some past data outcomes $y_t,y_{t-1},...,y_{0}$ and the Xs, predict *all* outcomes just given the Xs, figure out the DGP, etc.. – dimitriy Nov 15 '16 at 18:49
  • Using NB regression, I want to find which explanatory variables are the significant ones and therefore identify the variable of interest. The reason I want to use ML techniques such as trees is simply for exploratory purposes which is what trees are meant for. I do not wish to use trees for forecasting, predictive accuracy or cross validation – user3571389 Nov 15 '16 at 19:05

1 Answers1

5

The glmertree package on R-Forge (https://R-Forge.R-project.org/R/?group_id=261) extends the REEM tree approach in two directions: First, the response variable can come from the exponential family (including Gaussian and Poisson distributions among others). Second, the tree can be employed to not only learn a segment-wise constant mean but also include segment-specific regression slopes. Our working paper introducing the method is available from RePEc at http://EconPapers.RePEc.org/RePEc:inn:wpaper:2015-10

Having said that, however, the spread in your data is so large that I heavily doubt you will need a count data regression. I would start out by using log(y) as the response (or log(y + 0.5) if there are zeros) and try to build a flexible model for a continuous response.

Achim Zeileis
  • 13,510
  • 1
  • 29
  • 53