Dealing with categorical feature for xgboost using sagemaker

Asked Sep 03 '18 at 06:22

Active Dec 22 '18 at 22:08

Viewed 637 times

Currently, I have a dataset which contains 200,000+ datapoints and it contains 20 features with ~10 features as categorical. These categorical columns are countries, state, localities which contains >150 country name and hence converting them to one hot encoding might increase the computation. Is there any feasible way to do it?

I am using sagemaker's inbuilt xgboost algorithm. Does it deal the categorical datasets by default or should I have to convert it into some form?

edited Dec 22 '18 at 22:08

kjetil b halvorsen

63,378
26
142
467

asked Sep 03 '18 at 06:22

sahithya

You might get some good ideas from [this post](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels). To say more, can you tell us more about the context and goal of your model? Which variable do you want to predict? If you use sparse matrix representations, 150 columns of dummys should not be a problem! – kjetil b halvorsen Dec 22 '18 at 22:11

Dealing with categorical feature for xgboost using sagemaker

0 Answers0