2

Let's say, there is a variable sex in the data set. I could either:

  1. Build one model on the whole data and encode the sex into 0:female 1:male, or:
  2. Build two models. Split the data into two sets and use a separate model for sex=female and sex=male.

Is either approach preferable?

If so, what about a variable with three categories?

Frans Rodenburg
  • 10,376
  • 2
  • 25
  • 58
Zichu Lee
  • 21
  • 3

2 Answers2

1

For various reasons, unless there are reasons prohibiting you from doing so you should always use option (1), a single model. Consider for example:

  1. Diagnostics of residuals are easier to perform with larger $n$ and distributional assumptions might in fact be closer to the empirical conditional distribution when $n$ grows;
  2. If your groups share a common slope of another variable, the number of degrees of freedom to estimate that slope is now larger, and your estimate will be better;
  3. If your groups do not share a common slope, you can estimate an interaction with the categorical variable;
  4. If (2) is the case and you are using your model for confirmation, you have now not only split your total sample size, reducing your power, but also inflated the type I error rate by doubling the number of tests performed.

For categorical variables with more than 2 categories, this is even more important, as you would otherwise be splitting your data set into smaller and smaller subsets.

Frans Rodenburg
  • 10,376
  • 2
  • 25
  • 58
0

I will try to explain it visually

  • First Label Dataset ( Label = Orange )

enter image description here WE can see the decision boundary learned by our model

  • Second Label Dataset ( Label = Blue) enter image description here

  • Test Dataset ( Label = Blue,Orange, Model= Orange ) enter image description here

  • Inference

We can see that if we use the model learned by the First Label Dataset and test it on actual data ( in this case full data ), there are so many Blue data points that will be incorrectly labeled as Orange. So generally it is not a wise idea to create separate models However, there are exceptions when we would like to evaluate the probability of a new point to lie in either of the two labels. In this case we allow models to learn the patterns of data representing a single label

Anant Gupta
  • 300
  • 1
  • 3