Neural Nets: One-hot variable overwhelming continuous?

Question

I have raw data that has about 20 columns (20 features). Ten of them are continuous data and 10 of them are categorical. Some of the categorical data can have like 50 different values (U.S. States). After I pre-process the data the 10 continuous columns become 10 prepared columns and the 10 categorical values become like 200 one-hot encoded variables. I am concerned that if I put all of these 200+10=210 features into the neural net then the 200-one-hot features (the 10 categorical columns) will totally dominate the 10-continuous features.

Perhaps one method would be to "group" columns together or something. Is this a valid concern and is there any standard way of dealing with this issue?

(I am using Keras, although I don't think it matters much.)

Have you considered using two (or more) sequential models and then merging them? Each model has inputs that better match the data as it comes (as opposed to mashing it up like a sausage.) The targets are the same,but you make two sets of training data, each is fed independently during fitting. Directly after the merge comes your final output layer, so that final layer makes decisions as to which model works best for particular samples. From keras.io: https://keras.io/getting-started/sequential-model-guide/ — photox, Mar 08 '17 at 12:03
I tried this and the val_loss of the ensemble(model_1, model_2) was higher than the val_loss of model_1 and higher than the val_loss of model_2. — user1367204, Mar 10 '17 at 17:04
have you actually tried this, and determined that this issue does in fact occur? what tests did you do to check this point? what were the results? — Hugh Perkins, Jan 01 '18 at 05:18

score 5 · Answer 1 · answered Jul 30 '17 at 15:38

You can encode the categorical variables with a method different than one-hot. Binary or hashing encoders may be appropriate for this case. Hashing in particular is nice because you encode all of the categories into a single representation per feature vector, so no single one dominates the other. You can also specify the size of the final representation, so can hash all categorical variables into 10 features, and end up with 20 numeric features (half continuous, half categorical).

Both are implemented in https://github.com/scikit-learn-contrib/categorical-encoding, or fairly straight forward to implement yourself.

COOLBEANS · Answer 2 · 2018-04-16T22:39:27.230

You could use embedding to transform your large number of categorical variables into a single vector. This compressed vector will be a distributed representation of the categorical features. The categorical inputs will be transformed into a relatively small vector of length N with N real-numbers that in some way represent N latent features that describe all the inputs.

Consider the large number of words in the English dictionary. If this number is N, then we could represent each word as a one-hot-coded vector of length N. However, word-to-vec is able to capture virtually all this information in a vector of length between 200-300.

Neural Nets: One-hot variable overwhelming continuous?

2 Answers2