Categorical data representation in a supervised learning setting

Question

Given a dataset X (all features are categorical) and corresponding label y I need to find some meaningful representations of these categorical features. Are there any methods for that? I assume that they should be based on some existing learning algorithms.

`I need to find some meaningful representations of these categorical features` That sounds to me too general/vague. — ttnphns, Jan 10 '17 at 21:11

hh32 · Accepted Answer · 2016-11-02T16:07:18.093

1

There are several approaches. Here are some to give you an idea:

Labels

Just count the occurring classes and assign them unique values, e.g. if you have to classes cat and dog, assign a 0 to cat and 1 to dog.

One-Hot-Vector

One approach to this is called One-Hot-Vector. The idea is to find unique values in your data and then encode them in vector containing 1 at the specified slot, 0 else. Let's say you have a column containing color data such as

['red', 'green', 'yellow', 'red', 'green']

for your given samples.

The One-Hot-Vector would be as follows:

red green yellow
----------------
1    0    0
0    1    0
0    0    1

The reason behind this is that if you were to simply use numerical values, you would imply some kind of relation between the values. One-Hot-Vectors avoid this.

So if your data was

point color label
----------------
x_1 'green' 'cat'
x_2 'red' 'dog'
x_3 'yellow' 'cat'

it becomes

point color1 color2 color3 label
--------------------------------
x_1   0      1      0       0
x_2   1      0      0       1
x_3   0      0      0       0

which can then be applied to any learning algorithm that expects numeric data points.

Binary Encoding

You can find a set of unique values and encode them binary. Using the encoded values you can e.g. employ kmeans and use the Hamming Distance as a metric.

Bag of Words

Find unique values and put them in one set per data point. Disregard order, but keep count.

Decision Trees

Can handle categorical data themselves.

Association Rules

Some models work with rules, e.g.

if a customer bought beer, he will most likely also buy chips

Histograms

Find unique value per data point and then count them. The resulting histogram can then be used as a feature vector itself.

edited Nov 02 '16 at 16:07

answered Nov 02 '16 at 15:53

hh32

1,279
1
8
19

Thanks for that! Any methods that can use also groundtruth? My task is not clustering rather more metric learning. – fractile Nov 02 '16 at 15:59
Yes, sure, e.g. add the one hot vector to your data points and use the resulting data for metric learning. It's just a feature vector you can use for any learning algorithm. – hh32 Nov 02 '16 at 16:02
Ok, got it. One more thing. Is there is some kind of embedding method that based on the groundtruth and categorical data tries to position this data in some other numerical space so that in the end it makes sense (points of the same class are close together). – fractile Nov 02 '16 at 16:07
Yes. What your describing usually has to stages: 1. find and embedding. This can e.g. be done in an embedding layer in an ANN. The goal is to find an embedding of the data into an numeric space, see http://stats.stackexchange.com/questions/182775/what-is-an-embedding-layer-in-a-neural-network The Keras Package (and many others) supports this. The second is then metric learning, e.g. Kernel Density Metric Learning – hh32 Nov 02 '16 at 16:10
Why should I use metric learning afterwards? If I have my categorical vectors embedded I can directly have a notion of distance between them, no? – fractile Nov 02 '16 at 17:56
Elsewhere in statistics, "one hot vector" is called a "dummy variable expansion" or just "making dummies". They are "dummies" because they can't do anything but be zero or one. – generic_user Jan 10 '17 at 19:05