Encoding categorical features to numbers for machine learning

Question

Many machine learning algorithms, for example neural networks, expect to deal with numbers. So, when you have a categorical data, you need to convert it. By categorical I mean, for example:

Car brands: Audi, BMW, Chevrolet... User IDs: 1, 25, 26, 28...

Even though user ids are numbers, they are just labels, and do not mean anyting in terms of continuity, like age or sum of money.

So, the basic approach seems to use binary vectors to encode categories:

Audi: 1, 0, 0... BMW: 0, 1, 0... Chevrolet: 0, 0, 1...

It's OK when there are few categories, but beyond that it looks a bit inefficient. For example, when you have 10 000 user ids to encode, it's 10 000 features.

The question is, is there a better way? Maybe one involving probabilities?

Why would you want to include a user ID in a predictive model? As for other categorical variables with cardinality larger than you wish when you use dummy variable coding as you describe, I first run them through a decision tree as the only predictor - in order to collapse the levels. Can also re-bin by grouping "rare" levels etc. — B_Miner, Jan 26 '12 at 16:07
This sounds interesting--like random effects in a statistical model where you are interested in effects particular to a specific individual. I can imagine situations where it would be useful, for example if you see the same individuals again and again and would like to predict what that particular individual will do. Please do share more about your plans if you can. Also, you might look at multilevel modeling, although that's more traditionally used in inferential settings rather than machine learning. — Anne Z., Jan 27 '12 at 02:55
I remember reading about a ML contest, where some smart researchers detected that the user ids in the data have been given at the time of user account creation. Hence the time-stamps, which have been obfuscated, were revealed (positively influencing the prediction of response). Beside such cases and those mentioned by Anne (recommender systems) I wouldn't include the userID. — mlwida, Jan 27 '12 at 08:16
Anne - Isn't a random effects model actually NOT interested in the individuals - thus they are considered a sample from a population? — B_Miner, Jan 28 '12 at 01:54
I don't understand, if the learning problem is to predict the binary category wealthy/not wealthy, why wouldn't it make sense to have a feature for the brand of car of a particular user. User IDs could be used if the social network of the individual is known: e.g. to demonstrate that friends of user X are more prone to be wealthy. Is there anything wrong with this line of thought? — Vladtn, Mar 28 '12 at 11:59
"Why would you want to include a user ID in a predictive model?" Repeated measures analysis? — conjectures, May 22 '14 at 19:39

score 6 · Answer 1 · answered Jan 26 '12 at 16:08

6

You can always treat your user ids as bag of words: most text classifiers can deal with hundreds of thousands of dimensions when the data is sparse (many zeros that you do not need to store explicitly in memory, for instance if you use Compressed Sparse Rows representation for your data matrix).

However the question is: does it make sense w.r.t. you specific problem to treat user ids as features? Would not it make more sense to denormalize your relation data and use user features (age, location, char-ngrams of the online nickname, transaction history...) instead of their ids?

You could also perform clustering of your raw user vectors and use the top N closest centers ids as activated features for instead of the user ids.

answered Jan 26 '12 at 16:08

ogrisel

3,669
22
19

OK, while this is more a general question, I see most of you concentrated on the issue of user ids, so here's why I would want to use them. Let's look at one of Kaggle's competitions, about Grockit: http://www.kaggle.com/c/WhatDoYouKnow . The goal is to predict whether a user will answer a question correctly. It is a problem similar in my opinion to recommender systems, you just get questions instead of movies and correct/incorrect instead of ratings, plus some other data. Timestamps are available :) – Nucular Jan 29 '12 at 13:51
1

In that case you can make the assumption that the user are independents and you can build one classifier per user trained only on their own history. – ogrisel Jan 31 '12 at 08:40

score 1 · Answer 2 · answered May 22 '14 at 19:07

1

Equilateral encoding is probably what you are looking for when trying to encode classes into a neural network. It tends to work better than "1 of n" encoding referenced in other posts. For reference may I suggest: http://www.heatonresearch.com/wiki/Equilateral

answered May 22 '14 at 19:07

S Pike

11
1

This appears to be related to encoding output values, not categorial encoding for input values which is what the OP is asking for. – Alex Sep 22 '15 at 10:54

Encoding categorical features to numbers for machine learning

2 Answers2

Linked