I recently had to build a model for flight delay predictions. The data has multiple columns of categorical variables and some continuous variables.
For example: Airline, Arriving airport, Temperature, etc.
My natural instinct of encoding the categorical variables was to one-hot encode them, but resulted >2000 distinct features. So I tried to reduce the dimensions by
- Including airports, airlines that appear the most often
- Run random forest on the categorical features to extract important features
and fit a model like xgboost afterwards. The model's performance is ok, but not good enough to be shipped out.
I suspect the way I have reduced the dimensions is destroying the underlying relationship of the features. So I wonder, are there dimension reduction techniques that can handle mixture of continuous and categorical variables? And in general, what are some rule-of-thumb in situations like these?