What are appropriate methods for preparing categorical features for recurrent networks to ensure efficient backpropagation?

Question

Given a 1D sequential categorical input variable, e.g. [rainy, sunny, rainy, cloudy, cloudy], with a small domain {rain, sunny, cloudy}, what encoding methods (e.g. one-hot, dummy, binary) and what scaling methods (e.g. standardisation, min-max scaling) are appropriate specifically for use with RNNs such as LSTM and GRU given their logistic activation functions in comparison to other NNs which tend to use ReLU?

In Yann A. LeCun et al - Efficient BackProp Chapter in Neural Networks: Tricks of the Trade: Second Edition, pages 9–48, LeCunn states that preprocessing of data should be performed in such a way so as to ensure that the activations of the neurons have unit variance and mean zero (which is of course not possible for sigmoid activations) to ensure backprop is efficient, and presumably to ensure that signals to later layers lie approximately in the region [-1, 1] where in the gradient of the logistic functions is greatest.

I believe this is of particular significance to LSTM/GRU RNNs since if one uses one-hot encoding (or some other binary style encoding) the distribution of the responses will not be distributed with unit variance (sigmoid and tanh) as per LeCun's directions.

addendum - This has nothing to do with the convolution operation (as per the tagged duplicate) in the context of a categorical variable. This is a question about what are the appropriate encoding and scaling methods for GRU/LSTM RNNs which are likely specific to these types of networks due to the logistic type activation functions they use.

Possible dups: https://stats.stackexchange.com/questions/300945/can-convolution-neural-network-be-useful-with-encoded-categorical-features, https://stats.stackexchange.com/questions/104557/how-to-encode-categorical-variables-for-neural-networks, https://stats.stackexchange.com/questions/361077/how-to-input-multiple-categorical-variables-to-neural-network, https://stats.stackexchange.com/questions/139129/how-to-recode-categorical-variable-into-numerical-variable-when-using-svm-or-neu — kjetil b halvorsen, Jul 25 '19 at 16:08
@kjetilbhalvorsen Thanks for the links, but I don't think they add any new information to my question. — Jinglesting, Jul 25 '19 at 17:26
Look at the answers there. I think you find answer to your Q among them. If not, can you explain why? — kjetil b halvorsen, Jul 25 '19 at 17:28
The principle reason is they just restate methods (for encoding) which I already mention in my question, or have no information (for scaling). I am looking for some specific information regarding LSTM and GRU RNNs. Some justification for the reasoning would also be nice. — Jinglesting, Jul 25 '19 at 17:31
I think you need to make a stronger case. I cannot see why *logistic type activation functions* should have any influence on choice of encoding. Can you please explain why? — kjetil b halvorsen, Jul 25 '19 at 17:52
logistic activation functions have their greatest response from around -1 to +1. If you blindly follow using one-hot encoding (or some other binary style encoding) the distribution of the responses will not be normally distributed, this will likely affect training due to poor back propagation. See Yann A. LeCunet al - Efficient BackProp Chapter in Neural Networks: Tricks of the Trade: Second Edition, pages 9–48. To clarify this question is very specific to the RNN architectures I mentioned. I've updated the question to reflect this. — Jinglesting, Jul 25 '19 at 18:01
Can you please also add the content of this last comment, and maybe changing question title to something like " encoding of categorical variables for neural networks---does it lead to poor backpropagation" or something similar better reflecting the question. — kjetil b halvorsen, Jul 25 '19 at 18:43
@kjetilbhalvorsen please see the updated question. Feel free to upvote it too — Jinglesting, Jul 31 '19 at 22:21
Entity embeddings of one hot categories are probably the answer. — Sycorax, Jul 31 '19 at 22:35
@peterflom please could you remove the duplicate tag. As per Sycorax's novel answer in the comments which is not found in the "already has an answer here" links, this is not a duplicate question. — Jinglesting, Aug 01 '19 at 09:06
@Sycorax thank you, yes, this looks like a suitable strategy as it simultaneously deals with the encoding and scaling. Feel free to make it an answer and I'll mark it as correct. — Jinglesting, Aug 01 '19 at 09:07
Your assumption of activations is in general not true, and as far as I can tell they have followed the same trend as everything else DNN: sigmoid, relu, and now maybe a return to sigmoid with batch normalization and other architectural changes. — Wayne, Aug 01 '19 at 13:13

Sycorax · Accepted Answer · 2019-08-01T14:17:00.453

The answer is probably entity embeddings for categorical variables. The idea is to employ a strategy similar to word embeddings: put the categories into a lower-dimensional Euclidean space, and a neural network will sort out how to put the more-similar things closer together. This is done by feeding the one-hot data to a linear layer with identity activation that is trained in the usual way using back-prop. For neural networks at least, this tends to work better than the one-hot encoding on its own.

And of course the utility of embeddings in models with recurrent cells has been established for several years in the context of language models.

Cheng Guo, Felix Berkhahn. "Entity Embeddings of Categorical Variables"

We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mapping is learned by a neural network during the standard supervised training process. Entity embedding not only reduces memory usage and speeds up neural networks compared with one-hot encoding, but more importantly by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables. We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features. We further demonstrate in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown. Thus it is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit. We also demonstrate that the embeddings obtained from the trained neural network boost the performance of all tested machine learning methods considerably when used as the input features instead. As entity embedding defines a distance measure for categorical variables it can be used for visualizing categorical data and for data clustering.

Alternatively, there's nothing inherently wrong with just feeding one-hot data directly to the model (in some cases, even scaling is unnecessary). It might not be as good as entity embeddings, but it's common practice, a little bit simpler, and in some cases perfectly adequate.

The power of this method has more to do with the usefulness of embeddings and less to do with the importance of scaling or distribution assumptions. You can put one-hot encoded data through recurrent units and the model will be adequate; this is common for character-level embeddings with language models.

Thanks for the additional detail, this is sufficient for my current problem. However, after a bit of thinking, I'm still not completely clear on how you might define a target when training the embedding for, say, an MLP classifier if you don't have any information about how the categories relate to one another (i.e. they are not sequential so there is no "context" as there would be with skip-gram or CBOW). Does one simply use the main problems target variable(s)? — Jinglesting, Aug 01 '19 at 14:19
@Jinglesting Using entity embeddings just adds embedding layers to the network, and the embedding layers are trained the same way as the rest of the network, using back-prop from whatever your target/output is. So you don't need to know anything about the embedding to train it, for the same reason that you don't need to know about the weights and biases of other hidden layers. That's what back-prop is for. You might be thinking of word2vec using CBOW, which is is one kind of embedding, but its training procedure is completely distinct. — Sycorax, Aug 01 '19 at 14:21
Ok, thank you, I was just getting a bit confused, as it seems to be common practice to use pretained embeddings for seq2seq problems and according to what I've read, this often gives a small boost in performance. — Jinglesting, Aug 01 '19 at 14:24
Yeah, the key point to entity embeddings is that you *don't* have to come up with some specialized method to train them; instead, just use linear layers to do the embedding for you in the ordinary supervised way. This is developed more in the paper. — Sycorax, Aug 01 '19 at 14:25

What are appropriate methods for preparing categorical features for recurrent networks to ensure efficient backpropagation?

1 Answers1