Is it valid to have all zeroes in a One-Hot Encoded categorical feature?

Question

I'm building an MLP classification model and one of my features is the name of certain products. These names can be anything and in theory there could be an infinite number of different names in the model. However there are a pretty small number of names that we see a lot in our data, so I'd like to use the most common names as a categorical feature.

I'd like to use one-hot encoding to transform these names into something usable by the model, but my question is what to do with samples which do not have a common product name? My understanding was that these could be encoded as all zeroes, as they won't fit into any of the one-hot encoded feature's buckets. But is that a valid thing to do for one-hot encoding? Both Spark-ML and Scikit-Learn's one hot encoders don't seem to allow this.

An alternative is to put all the uncommon product names into their own shared bucket (an "everything else" bucket), but I'm unsure whether this will have unwanted effects on the model.

Tim · Accepted Answer · 2019-05-20T09:24:40.007

In this case neural network won't differ that much from regression in terms of what would be happening, so let's start with discussing linear regression as a simplified case. If all your categorical variables are "cold" in one-hot encoding, your features $\mathbf{X}$ is a matrix of all-zeroes. In such case, for linear regression you end up with

$$\begin{align} y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \varepsilon \\ &= \beta_0 + \beta_1 0 + \beta_2 0 + \dots + \beta_k 0 + \varepsilon \\ &= \beta_0 + \varepsilon \end{align}$$

so you basically return a value governed by the intercept (bias node). With single-layer neural network the thing that would change is that there would be multiple such "regressions", and there would be activation functions wrapping around them. What follows, instead of returning single bias, you would be returning multiple biases, and then aggregating them on the output layer. With multi-layer network, the higher layers will be functions of only the intercepts from the first layer.

Saying it differently, your network would return the "default" (biases-only) output for such configuration of features. If you didn't use biases, it would return all-zeroes.

Is is reasonable? The answer depends on if you are willing to accept that for each such case, the network will return same outputs. The result will however be the same as if you used a common "other" category for the remaining, unknown products.

Is it valid to have all zeroes in a One-Hot Encoded categorical feature?

1 Answers1