11

I have many measurements for multiple individuals, but I'm not sure how to account for that repeat-measure structure when running a random forest model.

Is there a way to account for underlying data structure of longitudinal data using a random forest model?

Is this even necessary? -- it seems to me that it should be...

I would especially like to be able to perform this in R.

theforestecologist
  • 1,777
  • 3
  • 21
  • 40
  • 1
    Note: I kept this short and simple to see if I could finally attract some responses to a question. If someone desires more info or extension of this question, please comment vs. downvoting. Again, it's not short due to lack of prior research, but because I want people to actually respond to it... :p – theforestecologist Nov 09 '16 at 21:19
  • Can you elaborate on what your goal is with this analysis? – dimitriy Nov 09 '16 at 21:59
  • 1
    My Goal is to produce a predictive model. the model would predict tree height from tree diameter, given the tree's species and plot location. Each tree is sampled multiple times across decades, so measurements are clustered within individuals. – theforestecologist Nov 09 '16 at 22:41
  • Do you want to predict the latest height? – dimitriy Nov 10 '16 at 00:01
  • @DimitriyV.Masterov I want to predict heights of trees that did not receive a height measurement. So I'll build and test my model with trees that DID receive both diameter and height measurements. I have ~90k samples to train/test a model. – theforestecologist Nov 10 '16 at 01:16
  • Let me try another way. If you have longitudinal measurements for the tree that you want to predict, you can use some of the past ones as predictors/features in your model to predict the current height. If you don't have previous data for the trees you want to predict, then I am not sure how you can use the longitudinal data. – dimitriy Nov 10 '16 at 01:26
  • 1 of 2: Ah. So the trees that I'm predicting height (HT) for never received a HT measure at all in their history in the project. They only had their diameters (D) measured. So I'm using trees that received D and HT measurements to build a model I can plug the no-HT trees' D values into to estimate their heights. – theforestecologist Nov 10 '16 at 01:37
  • 2 of 2: The longitudinal aspect of the data is not important; rather, it's an existing structure in the data I believe I should account for. Specifically, an individual's growth (HT~D relationship) is probably slightly unique to that individual, so it would seem to be inappropriate to split up any given individual in a splitting process. This seems especially worrisome if some entries of a given tree were used as in-bag samples and then the rest of that tree's measurements were uses as out-of-bag samples. I'd think that'd be some sort of independence violation or something... – theforestecologist Nov 10 '16 at 01:37
  • 2
    Why insist on using random forests with time series at all? There is a deep literature in statistics on multiple imputation in time series, not to mention the multitude of existing methods for time series modeling and prediction. Using RFs ignores that history while, in effect, rebuilding it with a blunter instrument. Just because you have a hammer (RFs), not everything is a nail. – Mike Hunter Jan 13 '17 at 14:30
  • @DJohnson wicking you provide info on some of the literature you're thinking of? I've seen some myself, but I'm interested in knowing which sources you'd recommen. – theforestecologist Jan 13 '17 at 15:45
  • 1
    Ok...the literature on multiple imputation probably starts with Little and Rubin's excellent book, *Statistical Analysis with Missing Data.* There, they develop the now canonical notions of MAR, MCAR, etc. More recently, Paul Allison's highly readable Sage book, *Multiple Imputation for Missing Data* has a good review of the literature up through the time it was pub'd. More recently, Sorjana's *Methodologies for Time Series Prediction and Missing Value Imputation* comes recommended but I am not familiar with it. – Mike Hunter Jan 13 '17 at 16:02

2 Answers2

4

There is a previous post that discussed including mixed-effects for clustered/longitudinal data.

How can I include random effects into a randomForest

Here is a good reference for decision tree implementations in R: http://statistical-research.com/a-brief-tour-of-the-trees-and-forests/

Also, you may want to review these slides http://www2.ims.nus.edu.sg/Programs/014swclass/files/denis.pdf

Jon
  • 2,180
  • 1
  • 11
  • 28
1

You could try the following packages in R:

  • REEMtree: which is no random forest but a single tree model where differences between objects are accounted for over time (so called random or mixed effects), and several trees could possible be ensembled, or

  • glmertree: like approaches that can account for segment-wise constant means - which could be adapted to account for individual specific growth patterns (see here).

Or you simply put age as a variable in your model to account for at least that bit of the individual tree characteristic?

nils
  • 11
  • 4
  • 1
    Can you put some more flesh on this as if the links go dead the answer will cease to be helpful. – mdewey Jan 13 '17 at 14:09
  • there also are papers on the packages: REEMtree (http://www.springerlink.com/content/ng44781g47736260/) and glmertree (http://econpapers.repec.org/paper/innwpaper/2015-10.htm) – nils Jan 13 '17 at 14:26