0

I am trying to do a regression analysis in which one of the predictors is a categorical variable that has three categories, say, A, B and C. These categories CANNOT be put in a specific order as e.g. A > B > C. But instead they just fall into different categories, as e.g. race = hispanic, race = asian or race = white etc.

Now, I found two different ways (by one source) to code this kind of data: system #1 Regression with Categorical Predictors and system #2 Additional Coding Systems ...

System 1 suggests to create three variables e.g. race1 race2 and race3 which will be coded 1 or 0 depending on which category the observations fall into. E.g. if the observation has race "hispanic", race1 will be coded 1 and race2 & race3 will be coded 0. Likewise for "asian" race1 & race3 will be coded 0 and race2 will be coded 1 etc. In the actual model, only two variables (e.g. only race2 and race3) will be used and race1 serves as the reference.

System 2 suggests to do it a little bit differently. You would create k-1 additional variables where k is the number of different levels. Thus you would create two variables race1 and race2. Now instead of using 1 or 0, -1/k, if the observation DOES NOT fall into the respective category, and (k-1)/k, if the observation DOES fall into the respective category, are used. In my example: if the observation has the race "hispanic", both race1 and race2 would be coded, say, -1/3, because hispanic observations should serve as the reference. Likewise for asian observations race1 would be coded 2/3 and race2 -1/3. Obviously, here also only two variables will actually be used in the model.

I just ran two regression analyses using the different coding schemes and got the exact same results, except for the constant, which is slightly lower for system #2. So what is the difference between the two schemes? Why should I prefer one over the other (except for the reason that system 1 is way more straight forward)?

EDIT: actually the two links I provided do not work currently (at least for me). That's why I tried my best to explain the two systems :)

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Tom K
  • 53
  • 3
  • 1
    There is some useful information in [the Wikipedia article](https://en.wikipedia.org/wiki/Categorical_variable) on categorical variables. – mdewey Dec 12 '16 at 16:23
  • 2
    Please take time to read this [answer](http://stats.stackexchange.com/a/221868/3277) which explains what is contrast coding types, how they differ and how to code them. – ttnphns Dec 12 '16 at 16:33
  • 1
    I addressed this in another post: https://stats.stackexchange.com/a/439818/247274. – Dave Mar 01 '20 at 23:27

1 Answers1

0

The second "system" places the dropped category into the baseline or constant/intercept of the regression model. Since your constant is lower in the second system, I would imagine that this category has a negative influence on the model. It is really preference as to which model you choose. However, I generally prefer to drop one of the categories for the "dummy" columns created from categorical variables.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467