4

How do we use one hot encoding if the number of values which a categorical variable can take is large ?

In my case it is 56 values. So as per usual method I would have to add 56 columns (56 binary features) in the training dataset which will immensely increase the complexity and hence the training time.

So how do we deal with such cases ?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
mach
  • 1,545
  • 3
  • 10
  • 12

1 Answers1

1

It should be possible to get away with two columns per feature. For a categorical variable with $n$ levels, consider the two coordinates to be $n$ roots of $1$.

So your encoding will look like this:

+----------+-----------+-----------+
| Category | X         | Y         |
| 1        | -1.000000 | 0.000000  |
| 2        | -0.993712 | 0.111964  |
| 3        | -0.993712 | -0.111964 |
| 4        | -0.974928 | 0.222521  |
| 5        | -0.974928 | -0.222521 |
| 6        | -0.943883 | 0.330279  |
| 7        | -0.943883 | -0.330279 |
| 8        | -0.900969 | 0.433884  |
| 9        | -0.900969 | -0.433884 |
| 10       | -0.846724 | 0.532032  |
| 11       | -0.846724 | -0.532032 |
| 12       | -0.781831 | 0.623490  |
| 13       | -0.781831 | -0.623490 |
| 14       | -0.707107 | 0.707107  |
| 15       | -0.707107 | -0.707107 |
| 16       | -0.623490 | 0.781831  |
| 17       | -0.623490 | -0.781831 |
| 18       | -0.532032 | 0.846724  |
| 19       | -0.532032 | -0.846724 |
| 20       | -0.433884 | 0.900969  |
| 21       | -0.433884 | -0.900969 |
| 22       | -0.330279 | 0.943883  |
| 23       | -0.330279 | -0.943883 |
| 24       | -0.222521 | 0.974928  |
| 25       | -0.222521 | -0.974928 |
| 26       | -0.111964 | 0.993712  |
| 27       | -0.111964 | -0.993712 |
| 28       | 0.000000  | 1.000000  |
| 29       | 0.000000  | -1.000000 |
| 30       | 0.111964  | 0.993712  |
| 31       | 0.111964  | -0.993712 |
| 32       | 0.222521  | 0.974928  |
| 33       | 0.222521  | -0.974928 |
| 34       | 0.330279  | 0.943883  |
| 35       | 0.330279  | -0.943883 |
| 36       | 0.433884  | 0.900969  |
| 37       | 0.433884  | -0.900969 |
| 38       | 0.532032  | 0.846724  |
| 39       | 0.532032  | -0.846724 |
| 40       | 0.623490  | 0.781831  |
| 41       | 0.623490  | -0.781831 |
| 42       | 0.707107  | 0.707107  |
| 43       | 0.707107  | -0.707107 |
| 44       | 0.781831  | 0.623490  |
| 45       | 0.781831  | -0.623490 |
| 46       | 1.000000  | 0.000000  |
| 47       | 0.993712  | 0.111964  |
| 48       | 0.993712  | -0.111964 |
| 49       | 0.974928  | 0.222521  |
| 50       | 0.974928  | -0.222521 |
| 51       | 0.943883  | 0.330279  |
| 52       | 0.943883  | -0.330279 |
| 53       | 0.900969  | 0.433884  |
| 54       | 0.900969  | -0.433884 |
| 55       | 0.846724  | 0.532032  |
| 56       | 0.846724  | -0.532032 |
+----------+-----------+-----------+
rightskewed
  • 3,040
  • 1
  • 14
  • 30