Implications of model splitting in multiple models based on categorical predictors

Question

Suppose the following data:

     y x1        x2        x3 ...
  1: 1  0 0.3742991 0.8801190 ...
  2: 8  1 0.5952571 0.5570877 ...
  3: 3  1 0.7512366 3.0847152 ...
  4: 1  1 0.7222142 3.3359335 ...
  5: 9  0 0.4699963 4.9957369 ...
  ---
10T: 6  1 0.3581322 8.4544518 ...

I am trying to estimate $E[y|x]$. Suppose I am using a linear regression model. To make the model-fitting fit into memory, I am inclined to make 2 models: one for $x_1 = 0$ and other for $x_1 = 1$. I am trying to understand the implications and assumptions as good as possible of my decision.

Note: In reality I have lots of data and a multi-class variable that seems suitable for model splitting.

- Brainstorm 1 -

To understand the implications and assumptions I am inclined to think in (un)directed graphs and causality, but I am not sure whether I am on the right track.

- Brainstorm 2 -

Using iterated expectation: $E[y|x] = E[y|x,x_1=0] P(x_1=0) + E[y|x,x_1=1] P(x_1=1)$. The expected values on the right hand side are estimated with the 2 models and $P(x_1)$ is estimated from data. So the two models are as good as one model? Hmm... still feel as I am missing... a lot. What about the extra degrees of freedom when using two models instead of one?

- Brainstorm 3 -

Data can be seen as hierarchical (cf. comments Bar) with top layer classes determined by $x_1$. Then one can model the data using one multilevel regression model (MLR) or using two OLS models. This question has been asked and answered before here. In short, the MLR is more parsimonious, more easy to interpret and has less variable (i.e. biased) but more precise estimators than OLS.

That sounds like you would be doing a de facto hierarchical model. You need to be careful about the interactions between your variables as well. If scalability is an issue, I'd suggest keeping your data as is and move to a scalable model. You can use SGD for a linear model, there are many scalable implementations out there. — Bar, Oct 26 '17 at 15:23
@Bar, stochastic gradient descent and [`sgd`](https://cran.r-project.org/web/packages/sgd/README.html) package could be an interesting approach. I am quite sure $y$ depends on most classes of $x_1$. Dummy-coding the _real multi-class_ $x_1$ variable would require use of sparse matrices to keep it in memory. Anyhow, I still would like to explore the model splitting: is computationally more expensive but very lightweight and extendable. By hierarchical model you are alluding to regression trees? Which means that I am manually "coding" the top layer in the tree structure? Interesting. — Davor Josipovic, Oct 26 '17 at 16:24
Not necessarily regression trees, there are hierarchical linear models as well: https://en.wikipedia.org/wiki/Multilevel_model. The lmer R package can be used to train those, but it doesn't scale well. https://www.r-bloggers.com/hierarchical-linear-models-and-lmer/ In your case I'd rather use xgboost though, unless intepretability is important. — Bar, Oct 26 '17 at 16:57

Implications of model splitting in multiple models based on categorical predictors

0 Answers0