Is it better to build N models for each category of data?

Question

I'm new to data science and I'm working on a challenge with some friends, I have a data set of 80 feature and around 4000 rows.

The data is split into 180 category (A,B,C,D...etc), at first I tried to apply XGBoost directly on all the train set and I got an RMSE of 0.11, I didn't make any advanced feature engineering.

Than I had the idea to fit a decision tree regressor for each category of the data, thus I had around 180 model in a dict, then in the test set, I would look at teh category name and load the corresponding model, I used only one variable through this way and I got an RMSE of 0.095 which is quite good because I was only using one basic feature that was strongly correlated to the target $y$ .

I'm wondering however if this is the best approach, is it common to build a model for each data category, because this way I have around 180 models, besides that each category when taken separately have from 10 - 20 rows of data which clearly not flexible enough.

Mainly I don't know if the following strategies would be good to try:

Use all the training set and just one model like XGBoost but improve features (work more on the feature engineering aspect)
Use a clustering algorithm to create clusters of similar categories and fit a model for each cluster.

Which approach is generally preferred in similar regression problems where I have several categories of each row ?

Haitao Du · Answer 1 · 2020-05-24T04:57:10.387

I have a very similar question here When to use mixed effect model?

Depending on what you want to do and the amount of data available, you may or may not build a model for each class / each "group of classes". (But 180 classes for 4000 rows, I would suggest the data is not enough if you wan to build too many models.)

Building one model for each class (or group of classes) will lead a very complicated model (high variance), that may work well in training data but not in testing data.
If you have enough data and only care about accuracy (not interpretability), building one model for each class is also OK. (Say, for each class you have couple thousands rows).
The intuition is that, if the data is not too much, we may need to take advantage of the commonalities between classes, using a 'mixed effect' model may be better.

Is it better to build N models for each category of data?

1 Answers1