Treating Categorical Variables as Continuous for Random Forest / Adaboost

Asked Dec 03 '15 at 19:08

Active Aug 03 '19 at 18:59

Viewed 321 times

What's the correct way to deal with categorical variables in packages like sklearn's RF and xgboost?

Is there any cons of treating the variables are continuous? E.g. encode class A as 1, class B as 2, class C as 3?

edited Aug 03 '19 at 18:59

asked Dec 03 '15 at 19:08

jxieeducation

3

To so would normally be a serious error! – kjetil b halvorsen Dec 03 '15 at 19:46
Makes sense, so would the correct way be: 1. make data into category 2. make it sparse? – jxieeducation Dec 03 '15 at 20:47
Possible duplicate of [Principled way of collapsing categorical variables with many levels?](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels) – kjetil b halvorsen Mar 16 '19 at 21:59
You could get some ideas from: https://stats.stackexchange.com/questions/390671/random-forest-regression-with-sparse-data-in-python/430127#430127, https://stats.stackexchange.com/questions/410939/label-encoding-vs-dummy-variable-one-hot-encoding-correctness/414729#414729, https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281 – kjetil b halvorsen Jan 25 '20 at 20:22

0 Answers0