5

Sometimes I encode categorical features as binary values - one feature per possible category value indicating whether that feature name matches the original category value (i.e. one-of-K scheme).

Now these values are linearly dependent, since obviously their total sum is 1.

Does this linear dependence matter for linear SVM, kernel SVM, logistic regression, etc.? Where does it matter so that I need to remove one of the features? Does it cause problems for normal linear regression? For which methods does it not make a difference?

Gerenuk
  • 1,833
  • 3
  • 14
  • 20

2 Answers2

1

Based on my understanding, collinearity will impact the estimation of the weights. It leads to multiple solutions. So if your goal is to see the weights of the features and calculate the significance, you probably have to remove one dummy value and use only K-1 values. In this case, the intercept is the weight of the dummy values that you removed. Or you can use K values without intercept.

But if your goal is to have a prediction model with high prediction performance (i.e., what you really care is the outcome), it does not matter which encoding schemes you use. If using K values, you can either disable the intercept term or add a regularization term to eliminate the impact of the collinearity.

Munichong
  • 1,645
  • 3
  • 15
  • 26
0

The linear dependence won't matter for any method that somehow excludes the intercept column from the design matrix. E.g. in SAS and R in linear regression you can request that the intercept is not included. If there is linear dependence, then what happens depends on the implementation. I believe in one of SAS regression procedures (probably PROC REG) they automatically drop one or more predictors to ensure linear independence.

Nik Tuzov
  • 511
  • 2
  • 10