I have a regression problem. Two input features describe a category and subcategory. For illustrative explanation, let's consider we speak about city and district.
Some more details about the regression problem: 39000 observations, 104 cities, each 1-17 districts. 5 additional demographical variables (age, salary, gender, marital status, education level). Trying to predict the number of children of a given person (from given city, district, age, salary, ....). In some districts, we have only 1 record
The question: Is there any specific method how to represent the nested categories for machine learning?
Important comments:
- Plain application of one-hot encoder to city-district pairs will not work as lots of combinations are very rare in the data.
- Still, not willing to ignore the information about districts completely.
- If doing just logistic regression, a hierarchical Bayesian model could help. However, what about xgboost or neural networks?
Non-specific attempts so far:
- To do one-hot encoding for both higher (city) and lower hierarchies (city-district) and then apply standard feature selection methods.
- To combine the above with explicit filtration of very rare combinations (say at least 5 examples).