1

I am planning to do a model (stage 1) for imputing missing data in a variable that will be used as a predictor in a subsequent model (stage 2). The overall goal of this project is having the best stage 2 model for interpolation.

This is a standard framework in temperature and air pollution modelling. The first stage is used to fill the gaps in a satellite predictor (i.e. MODIS land surface temperature) and the second stage links on ground monitoring network measurements (ambient temperature recorded with a termometer) to the satellite predictor and additional variables.

I am using a random forest approach for both stages. If I understand correctly, collinearity between predictors is not a big issue for random forests (Should one be concerned about multi-collinearity when using non-linear models?).

Since the predictors in stage 1 might also be useful for stage 2, I am wondering if I could use as predictors in stage 2 the predictors and the product from stage 1 together. Its worth mentioning that I have much more instances in stage 1 than in stage 2. Using a simple model with fewer predictors in stage 1 and then using the bulk of predictors in stage 2 is far less computationally intensive, but at the end you are using less information to fit the models than if you were using the bulk of predictors in stage 1 (where you have more instances) or just repeating the predictors in both stages.

Stage 1 product could be an intermediate endpoint or a mediator between stage 1 predictors and stage 2 product, but not all the effect of stage 1 predictors must be mediated. Since I am building prediction models, I should be using all available useful information and since random forest are robust to collinearity this shouldn't be an issue.

I am working with big data (more than 1Tb) so I am not sure if I can (or want to) run a structural equation model. I am just wondering whether repeating the same predictors in both stages is a valid approach.

0 Answers0