0

I want to run gradient boosting regression on a dataset whose rows are not independent. Specifically, the rows are clustered, and you could consider the clustering variable to be a random effect.

  1. What is the effect of ignoring the random effect, i.e. simply running the classifier on the target and the other features?
  2. What open source packages are available that can account for clustered data for gradient boosting?
  3. Any caveats to using the procedures from 2?

Edit: I saw How can I include random effects into a randomForest. I will now restrict my question to GBMs.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
ved
  • 962
  • 5
  • 15
  • 1
    So basically time series / repeated measures, this is not going to fly with RF, which does not take into account the nature of the data when bootstrapping data randomly for each tree. Choose a different model. – user2974951 Jul 19 '19 at 08:16
  • Is there a specific reason for not using a mixed effects model ? – Robert Long Jul 19 '19 at 09:16
  • @RobertLong I suppose predictive power of tree ensembles is usually better than linear models. Plus, another property of the tree models that I need for my particular problem is that they don't interpolate between points, which linear models do. – ved Jul 19 '19 at 16:26
  • Why do you think interpolation is impossible for tree-based models? – mkt Jul 19 '19 at 16:33
  • @mkt tree models are locally constant in the "box" defined by cut points on the features. Two "boxes" that are next to each other can have totally different values. – ved Jul 19 '19 at 16:42
  • There is no *linear* interpolation, but the target variable's value in an intermediate box is predicted - it is just assigned the value of the adjacent box. I don't think we disagree, though we are characterizing them differently. – mkt Jul 19 '19 at 17:06
  • @ved That's a very dangerous supposition. How many predictor variables to you have ? How many observations per cluster and how many clusters ? – Robert Long Jul 19 '19 at 18:53
  • @RobertLong 3 predictor variables. Around 500 obs per cluster. – ved Jul 22 '19 at 15:12
  • @mkt I don't quite follow--how is is adjacent box's value used for a box? – ved Jul 22 '19 at 15:14

1 Answers1

2

Gradient boosting with random effects was developed by Patrick Miller and described in Miller, McArtor, & Lubke (2017). Unfortunately that reference is just an abstract of a poster, but there appears to be a related arxiv paper here. It refers to an R package, metboost, which doesn't exist. Instead, there is the package mvtboost, written by Miller. The CRAN version of mvtboost doesn't have this functionality, but the GitHub version does through the funciton metb. This only works for continuous outcomes, and there doesn't seem to be nay plan to add binary outcomes as a feature.

Noah
  • 20,638
  • 2
  • 20
  • 58