2

I am running a multinomial logistic regression. The outcome variable is categorical with seven levels. The predictor is binary.

Very briefly, the experiment is such that I am asking whether a stimulus belonging to level A or B of the predictor makes a person's response more likely to belong to any of the seven levels of the dependent variable.

However, I am interested in the effect of the predictor on the likelihood of choose each of the sevev levels of the DV. This is troublesome because I know that one level of the dependent variable has to be treated as a reference case.

What should I do? Is there a way to still discern the effect of the predictor on the likelihood of choosing the reference category? Is it common practice to run and report the analysis with different levels being treated as the reference?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Dave
  • 1,641
  • 2
  • 14
  • 27
  • 2
    "Effect of the predictor" *relative to what?* Isn't that the basic statistical lesson to be drawn from the mathematical need for a reference case: that without it, the meaning of "effect" is indeterminate? Nevertheless, if you have only a single categorical regressor, why not just use effects coding for it? – whuber Jun 13 '17 at 21:48
  • If I were to use effects coding instead of dummy coding for the predictor, would that give me a result for each level of the dependent variable? – Dave Jun 13 '17 at 22:01
  • That indeed is the point of effects coding: the coefficients are interpreted as individual effects relative to an average. You can also code each level of the predictor with a binary indicator and simply leave out the constant term (which is the sum of all the binary indicators). – whuber Jun 13 '17 at 22:03
  • 1
    A single binary predictor isn't much to work with. I would start with a 2-way table as a much more straightforward approach. In this discussion-in-comments I think the role of your variables is getting confused. To be clear, Dave, your dependent/response variable is categorical with 7 unordered possibilities? And your only independent predictor variable is binary? – Gregor Thomas Jun 13 '17 at 22:08
  • To follow on whuber's comment, effect coding would indeed allow you to recover a value for the reference category of the predictor, but don't try to over-interpret the results. For mathematical reason, one level has to be omitted from the regression (and there is nothing strictly you can do about it!) - Effect coding can be seen as a mean centring procedure for categorical variables, but effect coding and dummy coding are linearly related, meaning that you can easily move from one coding strategy to the other. – Nicolas K Jun 13 '17 at 22:11
  • 1
    @Gregor yep that's right: independent variable is binary; dependent variable is categorical with seven unordered possibilities levels. – Dave Jun 13 '17 at 22:11
  • Just to be clear: the suggestion is to use effects coding on my binary predictor variable? – Dave Jun 13 '17 at 22:13
  • 5
    In the first place, why do you want to estimate this model? Having a binary predictor to predict 7 events sounds "over killed" - A simple cross-tabulation would do the job. – Nicolas K Jun 13 '17 at 22:14
  • No forget about effect coding It won't solve your "issue" of having to omit 1 level for your dependent variable. There is nothing you can do about it. That's being said, it does not mean that the reference level has no value - Actually (by constraint) the value for the ref level is 0. – Nicolas K Jun 13 '17 at 22:15
  • The reference doesn't mean it's meaningless, If one doesn't make sense just use the most commonly occurring one as your reference. – Josh Jun 13 '17 at 22:47
  • @Umka I don't mean to say that it is meaningless. But I would like to be able to report on the effect of the independent variable on all seven reference categories, which I believe I can't do if one is a reference category. – Dave Jun 13 '17 at 22:49
  • Is it possible to use all other categories as the reference? That is, have the outcome be the likelihood of choosing Catagory 1 vs. Categories 2-7, the likelihood of Choosing Category 2 vs. Category 1 and 3-7, etc.? – Dave Jun 15 '17 at 00:26

1 Answers1

2

This is almost a FAQ and asked many times on this site! The short answer is that for a categorical variable ("factor" in R-speak) with $k$ levels and $k-1$ degrees of freedom, one cannot estimate one "effect" for each of the $k$ levels, since the space generated by the factor has only dimension (that is, degrees of freedom $k-1$). There are many ways to parametrize the space, but the most usual one is to choose one reference level and measure the effect of each of the other levels by its differential effects as compared to the reference.

When that is done, we can say that the effect of the reference level itself is zero, so its coefficient is zero, with a standard error of zero, as there is no sampling variability in a constant value.

Software should help users by including that in the summary output table, as below, where I take the example used at Values of reference categories for main and interaction effects using lm() in R and edit in three lines for the three reference levels:

Coefficients: (1 not defined because of singularities)
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        21.500      3.349   6.421 1.22e-06 *** 
as.factor(cyl)4                       0         0      NA      NA     
as.factor(cyl)6                    -1.750      4.101  -0.427   0.6734    
as.factor(cyl)8                    -6.450      3.485  -1.851   0.0766
as.factor(gear)3                      0         0      NA      NA     
as.factor(gear)4                    5.425      3.552   1.527   0.1397    
as.factor(gear)5                    6.700      4.101   1.634   0.1154  
as.factor(cyl)4:as.factor(gear)3      0         0      NA      NA  
as.factor(cyl)6:as.factor(gear)4   -5.425      4.585  -1.183   0.2483    
as.factor(cyl)8:as.factor(gear)4       NA         NA      NA       NA    
as.factor(cyl)6:as.factor(gear)5   -6.750      5.800  -1.164   0.2559    
as.factor(cyl)8:as.factor(gear)5   -6.350      4.833  -1.314   0.2013   

The three reference levels in this example is as.factor(cyl)4 , as.factor(gear)3 and for the interaction as.factor(cyl)4:as.factor(gear)3. The values in the last two columns is NA, (Not Available), since a value there does not give any meaning, it is not defined. It does not give meaning to test a value that is zero by definition!

Many users lives would have been simplified if the report was written this way!

Other posts treating this is among others

Edit

There is an R package that can (among a lot of other goodies ...) make regression output tables including lines for the reference levels of factors. That is package gtsummary with function tbl_regression. For some examples see https://stackoverflow.com/questions/67225238/is-there-a-way-to-change-in-referent-category-in-the-gtsummary-to-ref-or-a

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467