1

Currently, I have a dataset which contains 200,000+ datapoints and it contains 20 features with ~10 features as categorical. These categorical columns are countries, state, localities which contains >150 country name and hence converting them to one hot encoding might increase the computation. Is there any feasible way to do it?

I am using sagemaker's inbuilt xgboost algorithm. Does it deal the categorical datasets by default or should I have to convert it into some form?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
sahithya
  • 11
  • 1
  • You might get some good ideas from [this post](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels). To say more, can you tell us more about the context and goal of your model? Which variable do you want to predict? If you use sparse matrix representations, 150 columns of dummys should not be a problem! – kjetil b halvorsen Dec 22 '18 at 22:11

0 Answers0