Decision Tree vs Regression for Multiple Categorical Inputs

Question

I have a problem with multiple categorical inputs. These categories do not intuitively map to integers, while preserving their adjacent relationship. Does it make more sense to us a Decision Tree than Linear Regression given this fact? I am getting very high validation error on my LR model with encoded categories, and significantly lower with a Decision Tree.

Im wondering, in practice, if multiple categorical values with the relationship outlined above prevents Linear Regression from being a good model, even after encoding.

Did you try the standard encoding with dummy variables (one per category, without reference category)? — Michael M, Jul 27 '17 at 05:42
Yes, I have represented all categorical values with integers. 'American' and 'Italian' map to 0 and 1, for example, which doesnt make sense when constructing a coefficient for that term in the Linear Regression. — redress, Jul 27 '17 at 05:46
Just replacing labels by integers is not the same as dummy coding. There, you represent each label by an own column. — Michael M, Jul 27 '17 at 05:47
Is it possible to group the categories thematically (without statistics) in few groups and represent each group by a dummy? — Michael M, Jul 27 '17 at 08:12
@MichaelM, no that is not possible. Im wondering if you can simply address the question outlined in the OP. Are Decision Trees better for modeling categorical values? — redress, Jul 27 '17 at 13:48
What does "These categories do not intuitively map to integers, while preserving their adjacent relationship." mean? — Peter Flom, Jan 26 '20 at 17:06

score 0 · Answer 1 · answered Jan 26 '20 at 16:53

You say you represented the category levels by integers. That is not the same as using dummy encoding! and should not be done. This error probably explains why a decision tree gave much better results. Try again, but now using dummy encoding.

You hint in comments about 75+ categories (better to say levels), why is that a problem? If you represent the design matrix using sparse matrices, storage should not be a problem. If too many parameters compared to $n$ is the problem, use some form of regularization. You could mine Principled way of collapsing categorical variables with many levels? for ideas.

Decision Tree vs Regression for Multiple Categorical Inputs

1 Answers1