Big categorical data

Question

I am trying to predict the price of used vehicles using three different models: Regression, ANN, and random forest. I am having 6 variables as an input for my model. One of my variables is the manufacturer of the vehicle, and I have 186 different manufacturers. The other 5 variables are numerical. Any idea on how I must approach this problem? I was thinking to do one hot encoding for the categorical variable (manufacture) and then apply PCA. Is that correct? Do I need to standardize my data before doing PCA?

Hi, welcome. 0) Why do you think this would be a problem? 1) How many observations do you have? 2) How uniform is the distribution of the manufacturers? 3) You could try a sensible supercategorization, e.g.: USA, Asia, Europe. Or: put all small categories together. 4) See this answer: https://stats.stackexchange.com/a/5777/ . – *Reviewer* — Jim, Jul 25 '18 at 15:43
Hi, I have 685608 observations. I don't have any additional information to categorize the data. I was thinking to make one hot encoding analysis (produce 186 columns), and then standardize the rest five variables and the do PCA. I am not sure if this approach is correct. — Anna, Jul 25 '18 at 15:50

score 0 · Answer 1 · answered Jul 24 '19 at 21:36

You could use regularized regression, although with 685608 observations regularization may be unnecessary. As your goal is to predict prices, you could validate your model with cross validation. Since you have one factor variable with many levels (manufacturer), you could try the fused lasso, see Principled way of collapsing categorical variables with many levels?.

When you have validated a regression model, if the predictions it give are not good enough, you can see if neural networks or random forest can give better results. But note that random forest may not work well when a categorical predictor have so many levels. Then, you could use the grouping of levels you got from the fused lasso model in construction the forest model.

PCA do not seem to be a relevant method for this problem.

score -2 · Answer 2 · answered Jul 26 '18 at 07:05

I think there is no need to do one-hot encoding for the categorical variable (I'll come to this point later). I recommend first to ensure the data is clean, ie, apply the necessary statistical test like skewness, missing value, collinearity and multicollinearity, outlier, near zero variance detection and treatment checks. Once you've completed these vital pre-processing stage, you'll have statistically significant variables for PCA analysis but then here is catch. PCA works only for continuous data. And you've a categorical dependent variable. There are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis or Multiple Correspondence Analysis for mixed data available in the FactoMineR, a R package. Do read these related posts on the same, 1 and 2 and 3.

Coming to one-hot encoding issue, if your using the FAMD algorithm in FactoMineR package, it will automatically take care of the categorical variable.

Big categorical data

2 Answers2