Hierarchical Categories as Input Features

Question

I have a regression problem. Two input features describe a category and subcategory. For illustrative explanation, let's consider we speak about city and district.

Some more details about the regression problem: 39000 observations, 104 cities, each 1-17 districts. 5 additional demographical variables (age, salary, gender, marital status, education level). Trying to predict the number of children of a given person (from given city, district, age, salary, ....). In some districts, we have only 1 record

The question: Is there any specific method how to represent the nested categories for machine learning?

Important comments:

Plain application of one-hot encoder to city-district pairs will not work as lots of combinations are very rare in the data.
Still, not willing to ignore the information about districts completely.
If doing just logistic regression, a hierarchical Bayesian model could help. However, what about xgboost or neural networks?

Non-specific attempts so far:

To do one-hot encoding for both higher (city) and lower hierarchies (city-district) and then apply standard feature selection methods.
To combine the above with explicit filtration of very rare combinations (say at least 5 examples).

Can you tell us some more details? Many cities with many districts within each one? You could look at multilevel models, maybe mixed models. — kjetil b halvorsen, Sep 29 '20 at 17:50
Yes, many cities, each with multiple districts. The district is specified by city name and district name. — Karel Macek, Sep 29 '20 at 19:18
How many observations in total? How many cities? How many variables, else, and what are they? What is the **ultimate** goal of modelling? — kjetil b halvorsen, Sep 29 '20 at 20:26
39000 observations, 104 cities, each 1-17 districts. 5 additional demographical variables (age, salary, gender, marital status, education level). Trying to predict the number of children. In some districts, we have only 1 record. — Karel Macek, Sep 30 '20 at 09:21
Can you please add that info as an edit to the post? Not everybody reads comments ... but I would go, as a Start, for a mixed effects model with district, nested in cities, as random intercepts (since it does not seem that the districts are interesting in themselves.) That takes care of the rare combinations, no need to filetr them. With your number of observations stay away from feature selection! — kjetil b halvorsen, Sep 30 '20 at 13:30

score 0 · Answer 1 · answered Sep 30 '20 at 15:04

As a start I would go for a random effects model with districts nested within cities, and then for each district a random intercept. Then the few districts with only one or very few observations should not be a problem. With 39000 observations and only a few covariates I would just use them all, avoid feature selection, see for instance Why is variable selection necessary?. Spline variables such as salary and age, and use relevant interactions!

You seem to want some flexible model, like a neural network? I'm not sure what much sense that make with 39000 observations, start first with flexible linear mixed models and evaluate the results. But there is an R package Buddle for neural networks with random effecs, and google gives a lot of hits.

Hierarchical Categories as Input Features

1 Answers1