Linear model with binary variable VS create two linear models

Question

Let's say, there is a variable sex in the data set. I could either:

Build one model on the whole data and encode the sex into 0:female 1:male, or:
Build two models. Split the data into two sets and use a separate model for sex=female and sex=male.

Is either approach preferable?

If so, what about a variable with three categories?

score 1 · Answer 1 · answered Jun 14 '19 at 05:10

For various reasons, unless there are reasons prohibiting you from doing so you should always use option (1), a single model. Consider for example:

Diagnostics of residuals are easier to perform with larger $n$ and distributional assumptions might in fact be closer to the empirical conditional distribution when $n$ grows;
If your groups share a common slope of another variable, the number of degrees of freedom to estimate that slope is now larger, and your estimate will be better;
If your groups do not share a common slope, you can estimate an interaction with the categorical variable;
If (2) is the case and you are using your model for confirmation, you have now not only split your total sample size, reducing your power, but also inflated the type I error rate by doubling the number of tests performed.

For categorical variables with more than 2 categories, this is even more important, as you would otherwise be splitting your data set into smaller and smaller subsets.

score 0 · Answer 2 · answered Jun 14 '19 at 06:29

I will try to explain it visually

First Label Dataset ( Label = Orange )

WE can see the decision boundary learned by our model

Second Label Dataset ( Label = Blue)
Test Dataset ( Label = Blue,Orange, Model= Orange )
Inference

We can see that if we use the model learned by the First Label Dataset and test it on actual data ( in this case full data ), there are so many Blue data points that will be incorrectly labeled as Orange. So generally it is not a wise idea to create separate models However, there are exceptions when we would like to evaluate the probability of a new point to lie in either of the two labels. In this case we allow models to learn the patterns of data representing a single label

Linear model with binary variable VS create two linear models

2 Answers2

Linked