difference between dummy variable categories that weren't omitted

Question

Assume we have a categorical variable (one-hot encoded) with three or more categories. {race1, race2, ..., race-n} To avoid the dummy variable trap, assume we omitted race1. Knowing the coefficients of race2,3,...n would help us compare each to race1.

What I'm trying to understand/figure-out is how to compare in the same model without rerunning the regression, the difference between race2 and race3.

Why not use _repeated contrasts_ instead of dummy variables (aka indicator or onehot contrasts)? See brief explanation of some popular contrast types within answer https://stats.stackexchange.com/a/221868/3277 — ttnphns, Apr 28 '18 at 18:42

Isabella Ghement · Accepted Answer · 2018-05-05T17:46:23.237

Assume your outcome variable Y is continuous and the model includes only the race factor, so that:

Y = beta0 + beta1* race2 + beta2*race3 + ... +   
beta-n-1*race-n + error

Then:

beta0 = mean value of Y when race is equal to  
race1

beta0 + beta1 = mean value of Y when race is   
equal to race2

beta0 + beta2 = mean value of Y when race is  
equal to race3

and so on.

If you are interested in the difference in the mean value of Y between race3 and race2, that will be given by:

(beta0 + beta2) - (beta0 + beta1) = beta2 - beta1

So you can set your contrast as:

c = (0, -1, 1, 0, ..., 0)

where the length of the contrast vector is the same as the number of beta coefficients in the model.

Comment:

When a model includes dummy variables used to encode the effect of a categorical variable, what that really means is that the model actually consists of a series of sub-models - one sub-model for each category of that variable. To write down each sub-model, simply set all the dummy variables to zero and then set each dummy variable to 1 in turns (while setting all other dummy variables to zero).

For the model:

Y = beta0 + beta1* race2 + beta2*race3 + ... +   
beta-n-1*race-n + error   (*),

the race variable is categorical with n categories and the dummy variables race2, ..., race-n are used to encode its effect on Y. (The race1 dummy variable was omitted from the model, reflecting the fact that race1 is treated as a reference category.)

Here are the n sub-models that can be derived from model (*).

Sub-model 1 corresponds to race = race1 and is obtained by setting all dummy variables in model (*) to 0. Its equation is given by:

Y = beta0 + error

In this sub-model, beta0 represents the mean value of Y when race = race1.

Sub-model 2 corresponds to race = race2 and is obtained by setting the dummy variable for race2 to 1 in model (*) and all other dummy variables to 0. Its equation is given by:

Y = beta0 + beta1 + error

In this sub-model, beta0 + beta1 represents the mean value of Y when race = race2.

...

Sub-model n corresponds to race = race-n and is obtained by setting the dummy variable for race-n to 1 in model (*) and all other dummy variables to 0. Its equation is given by:

Y = beta0 + beta-n + error

In this sub-model, beta0 + beta-n represents the mean value of Y when race = race-n.

The above sub-models help elucidate the interpretation of the parameters beta0, beta0 + beta1, ..., beta0 + beta2. Now we can construct differences between any of these parameters and interpret them. For example:

(beta0 + beta1) - (beta0) = beta1 represents the difference in the mean value of y among people for whom race = race2 and those for whom race = race1.
(beta0 + beta2) - (beta0 + beta1) = beta2 - beta1 represents the difference in the mean value of y among people for whom race = race3 and those for whom race = race2.

would `beta2 - beta1` be the coefficient of race3 if race2 was the reference in that case? I don't see how that works? — mamdouh alramadan, May 05 '18 at 05:32
I added a comment to my original answer which should help you understand more about how dummy variables work. — Isabella Ghement, May 05 '18 at 17:47

score 1 · Answer 2 · answered May 03 '18 at 21:14

There are several ways to do this, but they all entail constructing linear combinations of the coefficients, such as their difference. This is doable since the original regression gives you the covariances between coefficients in addition to their standard errors, so that you can use the formula for linear combinations of correlated random variables.

Here's an example showing this with Stata, where we are interested in comparing the effect on price of repair record of 4 versus 5 (with 1 as the base category).

. sysuse auto
(1978 Automobile Data)

. reg price i.rep78

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(4, 64)        =      0.24
       Model |  8360542.63         4  2090135.66   Prob > F        =    0.9174
    Residual |   568436416        64     8881819   R-squared       =    0.0145
-------------+----------------------------------   Adj R-squared   =   -0.0471
       Total |   576796959        68  8482308.22   Root MSE        =    2980.2

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       rep78 |
          2  |   1403.125   2356.085     0.60   0.554    -3303.696    6109.946
          3  |   1864.733   2176.458     0.86   0.395    -2483.242    6212.708
          4  |       1507   2221.338     0.68   0.500    -2930.633    5944.633
          5  |     1348.5   2290.927     0.59   0.558    -3228.153    5925.153
             |
       _cons |     4564.5   2107.347     2.17   0.034     354.5913    8774.409
------------------------------------------------------------------------------

. test _b[4.rep78] =  _b[5.rep78]

 ( 1)  4.rep78 - 5.rep78 = 0

       F(  1,    64) =    0.02
            Prob > F =    0.8899

. lincom  _b[4.rep78] - _b[5.rep78]

 ( 1)  4.rep78 - 5.rep78 = 0

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |      158.5   1140.558     0.14   0.890    -2120.026    2437.026
------------------------------------------------------------------------------

. margins a.rep78

Contrasts of adjusted predictions
Model VCE    : OLS

Expression   : Linear prediction, predict()

------------------------------------------------
             |         df           F        P>F
-------------+----------------------------------
       rep78 |
   (1 vs 2)  |          1        0.35     0.5536
   (2 vs 3)  |          1        0.15     0.6984
   (3 vs 4)  |          1        0.16     0.6886
   (4 vs 5)  |          1        0.02     0.8899
      Joint  |          4        0.24     0.9174
             |
 Denominator |         64
------------------------------------------------

--------------------------------------------------------------
             |            Delta-method
             |   Contrast   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       rep78 |
   (1 vs 2)  |  -1403.125   2356.085     -6109.946    3303.696
   (2 vs 3)  |  -461.6083    1185.87     -2830.656     1907.44
   (3 vs 4)  |   357.7333   888.5353      -1417.32    2132.787
   (4 vs 5)  |      158.5   1140.558     -2120.026    2437.026
--------------------------------------------------------------

You can also look at ratios of coefficients (though doing this by hand ges trickier since this is a non-linear combination of random variables):

. nlcom  ratio_4_5:_b[4.rep78]/_b[5.rep78]

   ratio_4_5:  _b[4.rep78]/_b[5.rep78]

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   ratio_4_5 |   1.117538   .9271601     1.21   0.228    -.6996625    2.934738
------------------------------------------------------------------------------

difference between dummy variable categories that weren't omitted

2 Answers2

Linked