One Hot encoding for large number of values

Question

How do we use one hot encoding if the number of values which a categorical variable can take is large ?

In my case it is 56 values. So as per usual method I would have to add 56 columns (56 binary features) in the training dataset which will immensely increase the complexity and hence the training time.

So how do we deal with such cases ?

What kind of algorithm are you using? Most can handle that stress. — jlimahaverford, Oct 03 '15 at 19:19
Might be useful: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Dec 20 '19 at 02:50

score 1 · Answer 1 · answered Oct 03 '15 at 21:07

It should be possible to get away with two columns per feature. For a categorical variable with $n$ levels, consider the two coordinates to be $n$ roots of $1$.

So your encoding will look like this:

+----------+-----------+-----------+
| Category | X         | Y         |
| 1        | -1.000000 | 0.000000  |
| 2        | -0.993712 | 0.111964  |
| 3        | -0.993712 | -0.111964 |
| 4        | -0.974928 | 0.222521  |
| 5        | -0.974928 | -0.222521 |
| 6        | -0.943883 | 0.330279  |
| 7        | -0.943883 | -0.330279 |
| 8        | -0.900969 | 0.433884  |
| 9        | -0.900969 | -0.433884 |
| 10       | -0.846724 | 0.532032  |
| 11       | -0.846724 | -0.532032 |
| 12       | -0.781831 | 0.623490  |
| 13       | -0.781831 | -0.623490 |
| 14       | -0.707107 | 0.707107  |
| 15       | -0.707107 | -0.707107 |
| 16       | -0.623490 | 0.781831  |
| 17       | -0.623490 | -0.781831 |
| 18       | -0.532032 | 0.846724  |
| 19       | -0.532032 | -0.846724 |
| 20       | -0.433884 | 0.900969  |
| 21       | -0.433884 | -0.900969 |
| 22       | -0.330279 | 0.943883  |
| 23       | -0.330279 | -0.943883 |
| 24       | -0.222521 | 0.974928  |
| 25       | -0.222521 | -0.974928 |
| 26       | -0.111964 | 0.993712  |
| 27       | -0.111964 | -0.993712 |
| 28       | 0.000000  | 1.000000  |
| 29       | 0.000000  | -1.000000 |
| 30       | 0.111964  | 0.993712  |
| 31       | 0.111964  | -0.993712 |
| 32       | 0.222521  | 0.974928  |
| 33       | 0.222521  | -0.974928 |
| 34       | 0.330279  | 0.943883  |
| 35       | 0.330279  | -0.943883 |
| 36       | 0.433884  | 0.900969  |
| 37       | 0.433884  | -0.900969 |
| 38       | 0.532032  | 0.846724  |
| 39       | 0.532032  | -0.846724 |
| 40       | 0.623490  | 0.781831  |
| 41       | 0.623490  | -0.781831 |
| 42       | 0.707107  | 0.707107  |
| 43       | 0.707107  | -0.707107 |
| 44       | 0.781831  | 0.623490  |
| 45       | 0.781831  | -0.623490 |
| 46       | 1.000000  | 0.000000  |
| 47       | 0.993712  | 0.111964  |
| 48       | 0.993712  | -0.111964 |
| 49       | 0.974928  | 0.222521  |
| 50       | 0.974928  | -0.222521 |
| 51       | 0.943883  | 0.330279  |
| 52       | 0.943883  | -0.330279 |
| 53       | 0.900969  | 0.433884  |
| 54       | 0.900969  | -0.433884 |
| 55       | 0.846724  | 0.532032  |
| 56       | 0.846724  | -0.532032 |
+----------+-----------+-----------+

How is this better than just using the category column itself? — amoeba, Oct 03 '15 at 23:03
For calculating distances. Just using the categories here will enforce an ordinal relationship which might not always hold — rightskewed, Oct 04 '15 at 03:00

One Hot encoding for large number of values

1 Answers1