2

I'm new to data science and I'm working on a challenge with some friends, I have a data set of 80 feature and around 4000 rows.

The data is split into 180 category (A,B,C,D...etc), at first I tried to apply XGBoost directly on all the train set and I got an RMSE of 0.11, I didn't make any advanced feature engineering.

Than I had the idea to fit a decision tree regressor for each category of the data, thus I had around 180 model in a dict, then in the test set, I would look at teh category name and load the corresponding model, I used only one variable through this way and I got an RMSE of 0.095 which is quite good because I was only using one basic feature that was strongly correlated to the target $y$ .

I'm wondering however if this is the best approach, is it common to build a model for each data category, because this way I have around 180 models, besides that each category when taken separately have from 10 - 20 rows of data which clearly not flexible enough.

Mainly I don't know if the following strategies would be good to try:

  • Use all the training set and just one model like XGBoost but improve features (work more on the feature engineering aspect)
  • Use a clustering algorithm to create clusters of similar categories and fit a model for each cluster.

Which approach is generally preferred in similar regression problems where I have several categories of each row ?

Dandly
  • 31
  • 1

1 Answers1

1

I have a very similar question here When to use mixed effect model?


Depending on what you want to do and the amount of data available, you may or may not build a model for each class / each "group of classes". (But 180 classes for 4000 rows, I would suggest the data is not enough if you wan to build too many models.)

  • Building one model for each class (or group of classes) will lead a very complicated model (high variance), that may work well in training data but not in testing data.

  • If you have enough data and only care about accuracy (not interpretability), building one model for each class is also OK. (Say, for each class you have couple thousands rows).

  • The intuition is that, if the data is not too much, we may need to take advantage of the commonalities between classes, using a 'mixed effect' model may be better.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213