Is there a point where you wouldn't use dummy variables? I.e., if getting dummy vars would lead to hundreds of variables?

Question

I built a web scraper that drew in a bunch of data and I have more qualitative variables than I expected. Originally there were just a few quantitative variables that I had intended to consider but, from a statistical standpoint, I understand there would be an introduction of bias if the design matrix was erroneously reduced due to laziness or personal preference.

What I'm facing is numerous columns in my data matrix that have 20 to 40 unique values. I'm wondering what you all would do in such a situation? Do you create the dummy variables and update the design matrix or is there a more efficient way to do this?

Note, these values are not ordinal. For example, one of the columns is a vehicle's 'Front Suspension Type' and another one is 'Rear Suspension Type.'

Thoughts? Please let me know if you need additional info and I appreciate any feedback in advance.

Edit: Additionally, there doesn't seem to be immense variety. Most of the values have around 10 entries while, for rear suspension type, the max entries is around 700 (for multi-link).

See if you can find an answer here: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Sep 20 '19 at 13:54

score 0 · Accepted Answer · answered Sep 20 '19 at 13:54

In situations where the cardinality of a categorical feature is high (looks like your case) a popular approach (besides one-hot encoding which you are trying to avoid) is to compute target statistics (target mean encoding). What one does is replace the category value (eg. a string) with a conditional expectation which is computed as such: $$\mathbb{E}[y|x^i = x^i_k]$$ where $x^i$ is your categorical feature and $x^i_k$ is the $k$th unique value that feature holds. This works particularly well for tree-based methods eg. boosting, random forest, etc. However, computing these target mean statistics on your whole training set could lead to overfitting so there are lots of variations of it. See the python package categorical-encoding and also the gradient boosting library CatBoost which can perform a very clever and robust version of this scheme.

All being said, you should still try some models with one-hot encoding. Please note that this is commonly done for classification tasks, but I have not seen it come up often in regression.

Is there a point where you wouldn't use dummy variables? I.e., if getting dummy vars would lead to hundreds of variables?

1 Answers1