Suppose the following data:
y x1 x2 x3 ...
1: 1 0 0.3742991 0.8801190 ...
2: 8 1 0.5952571 0.5570877 ...
3: 3 1 0.7512366 3.0847152 ...
4: 1 1 0.7222142 3.3359335 ...
5: 9 0 0.4699963 4.9957369 ...
---
10T: 6 1 0.3581322 8.4544518 ...
I am trying to estimate $E[y|x]$. Suppose I am using a linear regression model. To make the model-fitting fit into memory, I am inclined to make 2 models: one for $x_1 = 0$ and other for $x_1 = 1$. I am trying to understand the implications and assumptions as good as possible of my decision.
Note: In reality I have lots of data and a multi-class variable that seems suitable for model splitting.
- Brainstorm 1 -
To understand the implications and assumptions I am inclined to think in (un)directed graphs and causality, but I am not sure whether I am on the right track.
- Brainstorm 2 -
Using iterated expectation: $E[y|x] = E[y|x,x_1=0] P(x_1=0) + E[y|x,x_1=1] P(x_1=1)$. The expected values on the right hand side are estimated with the 2 models and $P(x_1)$ is estimated from data. So the two models are as good as one model? Hmm... still feel as I am missing... a lot. What about the extra degrees of freedom when using two models instead of one?
- Brainstorm 3 -
Data can be seen as hierarchical (cf. comments Bar) with top layer classes determined by $x_1$. Then one can model the data using one multilevel regression model (MLR) or using two OLS models. This question has been asked and answered before here. In short, the MLR is more parsimonious, more easy to interpret and has less variable (i.e. biased) but more precise estimators than OLS.