Handling categorical variables in various ML algorithms

Question

I have read at many places that Decision trees and Random forests, if deep enough, can handle categorical variables without one-hot encoding.

1) What is special about these algorithms that they can handle categorical variables without one-hot encoding?

2) Are there any specific algorithms in which we do Dummy encoding (n-1 columns created for a categorical variable) vs One Hot encoding (n columns created)? I fail to understand that when one column in One hot encoding is having information that can be gathered from other columns, why do we ever prefer One hot encoding and why does the concept even exist?

3) Why does it say "if deep enough"?

Any helpful resources, links, videos or your own explanations are welcome. I want to clear this doubt, once and for all.

Found a similar question but doesnt answer everything I want to know. https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest/19829#19829

from AN6U5's answer - says Random Forest does not require One-hot Encoding and I have read many more similar answers saying this.

Also, What algorithms require one-hot encoding?

Whether software works requires one-hot encoding or not is entirely dependent on the software, but the comments **directly below AN6U5's answer** claim that sklearn's implementation requires one-hot encoding. Presumably you have access to sklearn, so you should be able to try it yourself and find out. — Sycorax, Jun 05 '18 at 16:43

score 1 · Answer 1 · answered Jun 05 '18 at 17:11

If the categorical variable is non-numeric, most software (like R) will convert the one column into n columns. It could be that they are not referring to anything special about the mathematical algorithm, but simply that the function or program you are using handles the one-hot encoding automatically.
A reason for converting a categorical variable to n-1 dummy variables is because of an issue in OLS regression known as perfect multi-collinearity. In OLS regression, the excluded n-th dummy variable is now used as a reference for the n-1 other dummy variables. Perfect multicollinearity is NOT an issue in decision tree algorithms, and so you can use one-hot encoding. In fact, it'd be better to use one-hot encoding, as excluding that n-th column actually excludes information in these algorithms.
You'll have trouble answering this question, as "depth" has an arbitrary meaning. The only time it makes sense to leave a categorical variable as one column in your data is if the categorical variable is numeric and there is an order present. For instance, 2 is more than 1, 3 is more than 2, so on... At least this has been my experience.

Can you elaborate more on 1. I mean to ask how do these algorithms operate internally so as to be able to handle categorical variables even without one-hot encoding. Please cite some sources to read more. — rAmAnA, Jun 05 '18 at 17:34
The mathematical algorithms (in this case, the decision tree algorithms) do NOT handle categorical variables, rather, the functions/programs you use are coded such that one-hot encoding is performed before running your data through the mathematical algorithm. That is to say, if you use Python, the decision tree function you use will perform the one-hot encoding BEFORE building the model. The model is then built with the n dummy variables, without you having to do the one-hot encoding yourself. — Octavio Urista, Jun 05 '18 at 19:30

Handling categorical variables in various ML algorithms

1 Answers1