Many machine learning algorithms, for example neural networks, expect to deal with numbers. So, when you have a categorical data, you need to convert it. By categorical I mean, for example:
Car brands: Audi, BMW, Chevrolet... User IDs: 1, 25, 26, 28...
Even though user ids are numbers, they are just labels, and do not mean anyting in terms of continuity, like age or sum of money.
So, the basic approach seems to use binary vectors to encode categories:
Audi: 1, 0, 0... BMW: 0, 1, 0... Chevrolet: 0, 0, 1...
It's OK when there are few categories, but beyond that it looks a bit inefficient. For example, when you have 10 000 user ids to encode, it's 10 000 features.
The question is, is there a better way? Maybe one involving probabilities?