3

Let's assume we have a regression model with variable age and two categorical variables: education and gender.

1st categorical variable

  • woman
  • man

2nd categorical variable

  • no qualification

  • higher intermediate

  • graduate or more

Income = age + woman + higher-intermediate + graduate-or-more

How to interpret the coefficient for women? Is it the income difference between a woman and a man both with no qualification or is it the income difference between the woman compared to the man regardless of their education, if 1) how do I measure the latter?

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
user2018454
  • 41
  • 1
  • 1
  • 4
  • How many categorical variables are there? You seem to be indicating just two. – Michael R. Chernick Mar 31 '17 at 14:30
  • I simplified in the example keeping only 2, does it change the principle if there are more than 2 ? – user2018454 Mar 31 '17 at 14:41
  • 1
    I have found that it is easier to interpret the effects of categorical values if you don't dummy code them but instead indicate in the software that they are categorical (leaving the coding to the software). Then, instead of interpreting coefficients, look at and possibly test differences among means. The adjusted means may be called least squares means or estimated marginal means depending on the software. Most software will let you do Tukey hsd tests on theses means (in your case the three levels of Categorical variable 2). – David Lane Mar 31 '17 at 14:53

1 Answers1

2

When you have a regression model with one or more categorical variables, there is a level of each one of those variables that is taken as the reference level, and the model is adjusted taking into account these reference levels (for example, level "man" on your gender variable).

Then, you'll have to interpret it as follows: when gender is "man", the coefficient associated to "woman" won't have any effect on the response variable (you can think it as "woman" is 0). When gender is "woman", these variable is interpreted as 1, so the response variable will be affected by the asociated coefficient. So, if the "woman" coefficient is positive, this model is saying that womans have a higher incomes on average, and if it is negative, just the other way around.

The same happens with your education variable, but in this case, it has three levels. "no qualification" is the reference level, and you should use the coefficients of "higher-intermediate" or "graduate-or-more" only when you are trying to predict the response for people with these features.

  • Thank you Alex, assuming the coefficient for woman is -10,000. Does it mean women earn on average 10,000 less than men or women earn 10,000 less than men with no qualifaction as the reference group here is (man with no qualifaction). – user2018454 Apr 03 '17 at 09:38
  • Taking into account that the reference level for the education variable is "no qualification", your interpretation should be "no qualified woman earn on average 10,000 less than no qualified man". If you wanted to consider some other level of the education variable, you'll have to combine both coefficients, and maybe you should consider some interaction between those variables. – Àlex Porcel Apr 03 '17 at 10:12
  • 1
    Sorry but I get very confused as there are many contradictory answers about this topic here http://stackoverflow.com/questions/21677105/how-to-interpret-r-linear-regression-when-there-are-multiple-factor-levels-as-th and here http://stats.stackexchange.com/questions/120030/interpretation-of-betas-when-there-are-multiple-categorical-variables – user2018454 Apr 03 '17 at 12:23