1

What's the correct way to deal with categorical variables in packages like sklearn's RF and xgboost?

Is there any cons of treating the variables are continuous? E.g. encode class A as 1, class B as 2, class C as 3?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 3
    To so would normally be a serious error! – kjetil b halvorsen Dec 03 '15 at 19:46
  • Makes sense, so would the correct way be: 1. make data into category 2. make it sparse? – jxieeducation Dec 03 '15 at 20:47
  • Possible duplicate of [Principled way of collapsing categorical variables with many levels?](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels) – kjetil b halvorsen Mar 16 '19 at 21:59
  • You could get some ideas from: https://stats.stackexchange.com/questions/390671/random-forest-regression-with-sparse-data-in-python/430127#430127, https://stats.stackexchange.com/questions/410939/label-encoding-vs-dummy-variable-one-hot-encoding-correctness/414729#414729, https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281 – kjetil b halvorsen Jan 25 '20 at 20:22

0 Answers0