1

I recently had to build a model for flight delay predictions. The data has multiple columns of categorical variables and some continuous variables.

For example: Airline, Arriving airport, Temperature, etc.

My natural instinct of encoding the categorical variables was to one-hot encode them, but resulted >2000 distinct features. So I tried to reduce the dimensions by

  • Including airports, airlines that appear the most often
  • Run random forest on the categorical features to extract important features

and fit a model like xgboost afterwards. The model's performance is ok, but not good enough to be shipped out.

I suspect the way I have reduced the dimensions is destroying the underlying relationship of the features. So I wonder, are there dimension reduction techniques that can handle mixture of continuous and categorical variables? And in general, what are some rule-of-thumb in situations like these?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Chester Cheng
  • 299
  • 2
  • 12
  • 2
    See https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-categories and search this site for "many categories" – kjetil b halvorsen Feb 20 '18 at 10:28

0 Answers0