0

How to deal with regression when most of the independent variables are categorical having numerous (more than 10) levels and the dependent variable is continuous? For this would it make sense to perform dummy coding? What could be a better way to deal with this situation? I want to predict CPM (Cost per mille) based on certain categorical independent variables.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467

1 Answers1

0

I can recommend CatBoost if you want an out-of-the-box solution for categorical features.

Here's a link to a great example on Target Feature encoding (really like this one, I used it few times to similar problem), it might be very useful, however, you must be very attentive to train/test splits and overfitting using it.

Also, there's a library for categorical encoding as well, I only skimmed the documentation, but haven't had a chance to use it so far.

And random tutorial I've just found.

Despite there're models and libraries which can do preprocessing of categorical features for you, I would recommend writing at least some part of pipeline yourself understand the concept and be sure you're not missing anything (e.g. different categories in train and test, which is very common issue).

i1bgv
  • 1
  • 1