One hot encode nominal categorical variables for random forest?

Question

I looked for this before but I couldn't find it exactly, so let me know if it's a duplicate.

My question is, should categorical variables be one hot encoded to run Random forests? Or just transforming them to nominal is fine? Take into account that there are some continuous variables in my model as well, if that makes any difference. Also, I'm using Python's sklearn.

I did search this but anywhere I looked there was a different answer. I read this article as well which is quite informative, but I wanted to know if there's any consensus of some kind, or if any trustworthy reference can be found?

Edit: Decided to quickly try the one hot encoding implementation and the first thing I notice is that it is taking considerably longer to run.

This might depend on the software you use, there are some relevant Qs here: https://stats.stackexchange.com/questions/410939/label-encoding-vs-dummy-variable-one-hot-encoding-correctness/414729#414729 https://stats.stackexchange.com/questions/359861/how-to-deal-with-nominal-categorical-with-label-encoding/414715#414715 https://stats.stackexchange.com/questions/390671/random-forest-regression-with-sparse-data-in-python/430127#430127 — kjetil b halvorsen, Oct 06 '20 at 18:05
I don't know about python or sklearn, but I think you can find useful info in the links I posted ... — kjetil b halvorsen, Oct 06 '20 at 18:13
I don't really understand the distinction you make between nominal and one-hot encoded variable. You need a design matrix at some point. If your statistical package doesn't account for the type of variable you use as an input (like, you're using a numerically-coded variable in a regression model while the variable is a categorical one, ordered or not), then you'll likely get wrong results to your original question, Gael Varoquaux has some papers/talks on one-hot encoding. Surely, a lot of categories, which means a lot of variables once dummy-ccded, can be a source of problem, though. — chl, Oct 07 '20 at 19:35

One hot encode nominal categorical variables for random forest?

0 Answers0