Performing regression on a dataset with lots of categories

Question

I am trying to work on a price prediction model, the attributes have lots of categories and all these categories are coded as integers. I am assuming if I build a regression model on this, the model will treat these as numbers and not categories. If I were to one-hot encode these attributes then the dimensionality will increase drastically. Is there any workaround or best practices used in such scenarios?

The image shows different attributes with a count of unique categories in each and all of these are integers.

Most software will treat the integers as categories unless you tell the regression function that your data are categorical. In R, the command is `as.factor`. — Dave, Nov 12 '20 at 14:37
I'm pretty new to machine learning but from what I understand if I convert these integer columns into categories, then the algorithm may misinterpret it. For example, there is a possibility of the algorithm providing higher weightage to Stock code - 4 than Stock code - 1. So my plan was to one hot encode these columns and now since these columns have 4k+ unique values the dimensionality would blow up. Looking for some better approach to this. — Karthik K V, Nov 12 '20 at 14:57
one hot encoding should be fine.. however you need to use sparse matrices to represent the data to avoid memory explosion (ie only store nonzero entries), and you need some hierarchies and regularisation - eg stockcodes could be grouped into stock category eg Electronics-computer-accessory-mouse /brands? etc — seanv507, Nov 12 '20 at 16:07
@KarthikKV I do not follow your concern about the code misinterpreting the categorical columns. Could you please elaborate. — Dave, Nov 12 '20 at 16:10
catboost would be an ml algorithm targeting this usecase too — seanv507, Nov 12 '20 at 16:12
How many categories? Have a look at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels — kjetil b halvorsen, Nov 13 '20 at 16:06

score 0 · Answer 1 · answered Nov 12 '20 at 15:56

My advice to obtain the best results, start with a randomly selected subsample of the population and working with dummy variables, determine the contrasts of interest. Then, apply your pre-selection knowledge employing said contrasts of interest as variables to perform a new analysis on the whole database excluding the previewed subsample.

Why? Here is a reference, to quote:

Planned orthogonal contrasts have the greatest amount of statistical power of any of the multiple comparison methods. That means that planned orthogonal contrasts are more likely to identify true population differences than the alternatives (such as Dunnett and Scheffé). However, they require that you be able to specify your hypotheses in the form of contrasts before the experiment, and that you are able to obtain equal group sizes. If you add even one observation to any of the groups, the correlations among the vectors will no longer be 0.0, you’ll have lost the orthogonality, and you’ll need to resort to (probably) planned nonorthogonal contrasts, which, other things equal, are less powerful.

score 0 · Answer 2 · answered Nov 12 '20 at 16:35

You definitely want your software to encode this information as categorical, even massively categorical. Once that's done, then impact coding can be used to estimate numeric values for each level of a factor. IC is a pre-processing method applied prior to running a model. One huge advantage of this approach is that it greatly reduces the computational cost and time involved in inverting the cross-product matrix implicit with models using categorical information as discrete factors. Here's a link to a detailed explanation of how to implement it. https://win-vector.com/2012/07/23/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/

Performing regression on a dataset with lots of categories

2 Answers2