Encoding categorical variables with lots of categories

Asked Feb 20 '18 at 08:58

Active Feb 20 '18 at 10:26

Viewed 429 times

I recently had to build a model for flight delay predictions. The data has multiple columns of categorical variables and some continuous variables.

For example: Airline, Arriving airport, Temperature, etc.

My natural instinct of encoding the categorical variables was to one-hot encode them, but resulted >2000 distinct features. So I tried to reduce the dimensions by

Including airports, airlines that appear the most often
Run random forest on the categorical features to extract important features

and fit a model like xgboost afterwards. The model's performance is ok, but not good enough to be shipped out.

I suspect the way I have reduced the dimensions is destroying the underlying relationship of the features. So I wonder, are there dimension reduction techniques that can handle mixture of continuous and categorical variables? And in general, what are some rule-of-thumb in situations like these?

edited Feb 20 '18 at 10:26

kjetil b halvorsen

63,378
26
142
467

asked Feb 20 '18 at 08:58

Chester Cheng

2

See https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-categories and search this site for "many categories" – kjetil b halvorsen Feb 20 '18 at 10:28

Encoding categorical variables with lots of categories

0 Answers0