As dummy creation increases our number of columns, as compared to label encoding, how are these treated in our sample data. When using them in our linear regression models, how will they affect dimensionality of data with regard to 'curse of dimensionality' as I was told that increasing number of features with no change in sample may lead to reduction in accuracy of model ?
1 Answers
What matters is how your modeling interprets the label encoding.
This answer and the sklearn documentation indicate that "label encoding" simply converts text levels of a categorical variable to numeric values between 0 and one less than the number of categories. So if the categories are in a natural order captured correctly by the label encoding, and each step up that order has the same influence on outcome (possibly in some transformed scale), then you can treat the label-encoded data as a single numeric variable. Under that assumption that would represent only one "feature" in your model.
That ordering and same influence per step up the order for a categorical variable, however, often is not the case. If there is a numeric ordering that needs further manipulation to make your model work (e.g., spline modeling) that variable will represent a correspondingly larger number of features. If there is no ordering among the levels of the categorical variable, then label encoding doesn't really make sense and you will have to count each level of the variable after the first as another "feature," as in dummy encoding.

- 57,766
- 7
- 66
- 187