I'm new to data science and I'm working on a challenge with some friends, I have a data set of 80 feature and around 4000 rows.
The data is split into 180 category (A,B,C,D...etc), at first I tried to apply XGBoost directly on all the train set and I got an RMSE of 0.11, I didn't make any advanced feature engineering.
Than I had the idea to fit a decision tree regressor for each category of the data, thus I had around 180 model in a dict, then in the test set, I would look at teh category name and load the corresponding model, I used only one variable through this way and I got an RMSE of 0.095 which is quite good because I was only using one basic feature that was strongly correlated to the target $y$ .
I'm wondering however if this is the best approach, is it common to build a model for each data category, because this way I have around 180 models, besides that each category when taken separately have from 10 - 20 rows of data which clearly not flexible enough.
Mainly I don't know if the following strategies would be good to try:
- Use all the training set and just one model like XGBoost but improve features (work more on the feature engineering aspect)
- Use a clustering algorithm to create clusters of similar categories and fit a model for each cluster.
Which approach is generally preferred in similar regression problems where I have several categories of each row ?