1

I have a problem with multiple categorical inputs. These categories do not intuitively map to integers, while preserving their adjacent relationship. Does it make more sense to us a Decision Tree than Linear Regression given this fact? I am getting very high validation error on my LR model with encoded categories, and significantly lower with a Decision Tree.

Im wondering, in practice, if multiple categorical values with the relationship outlined above prevents Linear Regression from being a good model, even after encoding.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
redress
  • 740
  • 1
  • 7
  • 16
  • Did you try the standard encoding with dummy variables (one per category, without reference category)? – Michael M Jul 27 '17 at 05:42
  • Yes, I have represented all categorical values with integers. 'American' and 'Italian' map to 0 and 1, for example, which doesnt make sense when constructing a coefficient for that term in the Linear Regression. – redress Jul 27 '17 at 05:46
  • 1
    Just replacing labels by integers is not the same as dummy coding. There, you represent each label by an own column. – Michael M Jul 27 '17 at 05:47
  • I have 75+ categories... – redress Jul 27 '17 at 05:52
  • Is it possible to group the categories thematically (without statistics) in few groups and represent each group by a dummy? – Michael M Jul 27 '17 at 08:12
  • The curse of dimensionality... – g3o2 Jul 27 '17 at 09:55
  • @MichaelM, no that is not possible. Im wondering if you can simply address the question outlined in the OP. Are Decision Trees better for modeling categorical values? – redress Jul 27 '17 at 13:48
  • 1
    What does "These categories do not intuitively map to integers, while preserving their adjacent relationship." mean? – Peter Flom Jan 26 '20 at 17:06

1 Answers1

0

You say you represented the category levels by integers. That is not the same as using dummy encoding! and should not be done. This error probably explains why a decision tree gave much better results. Try again, but now using dummy encoding.

You hint in comments about 75+ categories (better to say levels), why is that a problem? If you represent the design matrix using sparse matrices, storage should not be a problem. If too many parameters compared to $n$ is the problem, use some form of regularization. You could mine Principled way of collapsing categorical variables with many levels? for ideas.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467