does it make sense to divide training data into sub-groups and fit model for each of them?

Question

I am working on a fleet management system and I need to predict user request, and I use a linear model to do this.

$r = \beta*X$

I am thinking about dividing X into individual groups and fit a dedicated model for each individual group.

For example, for $X$, we have $x_{7}$ and $x_{8}$, both of which are Gaussian with mean at 0. I am thinking about dividing the training samples into 2 groups: $x_{7}*x_{8}>0$ belongs to one group and $x_{7}*x_{8}<0$ belongs to another group.

From business logic perspective, such division makes sense. In practice, having 2 individual models shows better OOS performance. However, I am not sure if it makes sense from a statistical perspective. Also, I am not sure what's the consequence of such dependency.

Can anyone give any insights here?

Thanks a lot!

It appears as if you are attempting to have two models, each with their own set of predictors, X, because of the difference in the nature of the samples. If so, you will just have to understand in your use of the models that the models are different. I would compare results of your "sub-groups" method with keeping all the data in the training set to see if dividing is even necessary. — a.powell, Apr 03 '17 at 02:26

wolfe · Answer 1 · 2017-04-03T03:09:53.227

1

Yes, there is the possibility modeling the two sub groups separated, if the independent variable distribution is mixed two distribution under different conditions . So, i will try to cluster the training set (independent variable with dependent variables ) to make sure modeling two or more sub groups is necessary.

edited Apr 03 '17 at 03:09

answered Apr 03 '17 at 03:04

wolfe

571
5
10

score 1 · Answer 2 · answered Apr 03 '17 at 07:08

It depends on your specific dataset and problem. Dividing data into subgroups and trying to fit different models for each of them may be useful or harmful. Consider the following simple training set as an example (I just show the sign of numbers):

x1  x2  x3  y
-   -   -   -
-   -   +   +
+   +   +   +
+   +   -   -
+   -   -   +
-   +   +   -
+   -   +   -
-   +   -   +

Here, it seems useful to divide the training set based on $x_1x_2$ since $x_1x_2>0$ has a different behaviour than $x_1x_2<0$; in the first case, $x_3$ and $y$ has the same sign, where in the second case, $x_3$ and $y$ has the opposite sign. But note that such division can make your system more vulnerable to the input noise. For example, in the test phase, if the sign of $x_1$ is changed due to noise, the correct model is compeletly ignored.

But, if there is not such different real patterns for different subgroups, dividing the data can be harmful, because reducing the number of training samples may cause the overfitting problem. It may cause fake patterns to be appeared in each subgroup due to the limited number of samples in each group.

does it make sense to divide training data into sub-groups and fit model for each of them?

2 Answers2

Linked