Which categorical variable encoding should I use if I want the model intercept to have the interpretation of the global mean?

Question

In a mixed effects model

$$ y_{ij} = \beta_{00} + \beta_{01} x_{1i} + \beta_{02} x_{2i} + \beta_{03} x_{3i} + u_i + \epsilon_{ij}$$

where $x_1, x_2, x_3$ are dummy variables coding the levels of a discrete (multinomial) variable $\tilde{x}$ that has more than two levels (here four), I want to give the intercept $\beta_{00}$ the interpretation of the (global) population mean, which is $E(y_{ij})$.

Right now the covariates are coded in a way that the intercept is interpreted as the mean of the reference category of $\tilde{x}$.

Is there a way to achieve this?

I have found a good overview on effect coding, but this type of mean coding is not part of it.

Edit: I just remembered how to do this for a variable $\tilde{x}$ that has two categories only. Then we have the model

$$ y_{ij} = \beta_{00} + \beta_{01} x_{1i} + u_i + \epsilon_{ij}$$

where the Dummy $ x_{1i}$ is defined to be $(1-p)$ if $\tilde{x} =1$ and it is $(-p)$ if $\tilde{x} =0$, where $p$ is the proportion with $\tilde{x}=1$.

Edit 2: Following the reply by Robert Long, deviation coding can be used when the number of observations for each level of $\tilde{x}$ are the same. However I am looking for a solution for multinomial $\tilde{x}$ possible with unequal class probabilities. Here is some code to implement deviation coding with multinomial $\tilde{x}$ demonstrating that this coding does not estimate the global mean. I suspect some sort of category weighting on the deviation coded dummies is needed instead (like I did for the two-category case above).

# Code to assess deviation coding for multinomial $xt$    
library(MASS)
library(dplyr)
n = 1000
set.seed(13)
xt = rmultinom(n, 1, c(1/3, 1/3, 1/3))
xt = as.factor( apply( t(t(xt) * c(1,2,3)), 2, sum) )
X <- model.matrix(~ xt)
betas <- c(3, 1, 2)
Y <- X %*% betas + rnorm(n)
mean(Y)

lm(Y ~ xt) %>% coef()   # default treatment coding

contrasts(xt) <- contr.sum(3) # specify deviation coding
lm(Y ~ xt) %>% coef()

Edit 3: Originally the question was titled "Which effect coding (categorical encoding) to use if I want the model intercept to have the interpretation of the global mean?" The title wrongfully suggested that my objective could be reached by effect coding alone. The answer by Robert Long applies to balanced categories and then deviation coding should be used.

Please clarify what you mean by the "(global) population mean." Are the four levels of the categorical variable equally represented both in your data sample and in the population? Otherwise, a "global population mean" could be a good deal different from, say, the mean of the means of the 4 groups. — EdM, Sep 26 '20 at 15:55
@EdM By global mean I mean the expectation of $Y$. The categorical variable $\tilde{x}$ may be assumed to be multinomially distributed with unequal class probabilities. — tomka, Sep 26 '20 at 15:58
Specifically, the expectation of $Y$ in your data sample, perhaps with the assumption that the group representation in your sample represents that in the population? — EdM, Sep 26 '20 at 16:00
Please look through this answer with examples https://stats.stackexchange.com/a/221868/3277. You probably want deviation (effect) type of coding of your factor. — ttnphns, Sep 26 '20 at 16:02
@EdM The estimated intercept of the model should be equal to the sample mean, and the sample mean estimates the population mean. See my edit for an example with two categories. — tomka, Sep 26 '20 at 16:07
@ttnphns Long answers there, will need some time to look at that. However, see the example in my edit for two categories. Is this the deviation effect coding you mean? — tomka, Sep 26 '20 at 16:08

Robert Long · Answer 1 · 2020-09-29T09:20:46.250

9

If the data are balanced, then deviation coding should work.

Let's look at a simple example:

set.seed(13)
dt <- expand.grid(X1 = LETTERS[1:3], reps = 1:5)
X <- model.matrix(~ X1, dt)
betas <- c(3, 1, 2)
dt$Y <- X %*% betas + rnorm(nrow(dt))
mean(dt$Y)

[1] 4.11413

So we would like the intercept to be 4.11413

If we fit the model with default coding we get:

lm(Y ~ X1, dt) %>% coef()   # default treatment coding

(Intercept)         X1B         X1C 
  3.3430627   0.2867999   2.0264018

But now if we use deviating coding we get

contrasts(dt$X1) <- contr.sum(3) # specify deviation coding
lm(Y ~ X1, dt) %>% coef()

(Intercept)         X11         X12 
  4.1141299  -0.7710672  -0.4842673

If the data are unbalanced then you will need to do some post-hoc adjustement.

Edit: To address what to do when the data are unbalanced.

In this case, it is easier to work with default treatment coding rather than deviation coding:

> set.seed(1)
> dt1 <- expand.grid(X1 = LETTERS[1:1], reps = 1:5)
> dt2 <- expand.grid(X1 = LETTERS[2:2], reps = 1:3)
> dt3 <- expand.grid(X1 = LETTERS[3:3], reps = 1:2)
> dt <- rbind(dt1, dt2, dt3)
> table(dt$X1)

A B C 
5 3 2

So the groups are unbalanced.

> X <- model.matrix(~ X1, dt)
> betas <- c(2, 3, 1)
> dt$Y <- 4 + X %*% betas + rnorm(nrow(dt), 0, 1)
> mean(dt$Y)

[1] 7.232203

So we would like to recove 7.23 with a post hoc calculation, which can be acheived fairly easily with

> coef(lm(Y ~ X1, dt))[1] + betas[2] * table(dt$X1)[2]/nrow(dt) + betas[3] * table(dt$X1)[3]/nrow(dt)

(Intercept) 
   7.22927

Note that the result is not exact due to the combination of imbalance in the groups and the random error. As the error approaches zero, the result becomes exact. Even with error, the result is also unbiased, as we can see from a monte carlo simulation:

n.sim <- 1000
vec.sim <- numeric(n.sim)

for (i in 1:n.sim) {
  
  set.seed(i)

  dt$Y <- 4 + X %*% betas + rnorm(nrow(dt), 0, 1)

  vec.sim[i] <- mean(dt$Y) - (coef(lm(Y ~ X1, dt))[1] + betas[2] * table(dt$X1)[2]/nrow(dt) + betas[3] * table(dt$X1)[3]/nrow(dt))

}

hist(vec.sim)
mean(vec.sim)

[1] -0.003418483

Edit: As noted in the comments, we should really be using the coefficient estimates from the model, and doing so will then make the calculation exact:

> coef(lm(Y ~ X1, dt))[1] + coef(lm(Y ~ X1, dt))[2] * table(dt$X1)[2]/nrow(dt) + coef(lm(Y ~ X1, dt))[3] * table(dt$X1)[3]/nrow(dt)
(Intercept) 
   7.232203

edited Sep 29 '20 at 09:20

answered Sep 26 '20 at 17:52

Robert Long

53,316
10
84
148

1

also I think referred to as "sum-to-zero" contrasts, or "effect coding" https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-effect-coding/ ... For what it's worth, your answer only works to retrieve the grand mean if the original data set is balanced ... – Ben Bolker Sep 26 '20 at 23:46
2

@BenBolker Agreed. If unbalanced it wil be a weighted average of the means in each group, I believe ? – Robert Long Sep 27 '20 at 06:05
1

Yes, that's right. In principle there should be a way to "counter-weight" the coded values (probably by just multiplying or dividing each indicator by the population fraction), but this adjustment is usually done *post hoc* (computing least-squares or expected marginal means) – Ben Bolker Sep 27 '20 at 12:17
@BenBolker Sorry, what do you mean by balanced data? That the multinomial classes of x have all same probability? I think that's exactly the problem I post. Basically I would need to know how th counter-weighting approach works. – tomka Sep 28 '20 at 08:24
Ahh now I see. By balanced you mean exactly same number of observations for each factor level (which is stronger than only having same probabilities for each class). So unfortunately, no, this does not solve my problem. – tomka Sep 28 '20 at 08:35
@tomka yes, balanced means the same number of observations in each group. The question about is about contrast coding, and this is the way to do it, with balanced data. If your data are unbalanced then you will need to do some further adjustment, not related to contrast coding. You could ask a new question about how to do that. – Robert Long Sep 28 '20 at 08:46
Somewhat disagree that it is a new question. I think the question is plain and simple how to achieve the global mean with multinomial $\tilde{x}$. I think it is part of the answer that it is not only contrast coding that is required but also weighting. Your answer gives an important building brick but the weighting part of the answer is still missing (I do not know how to achieve the right weights in all but the binomial case). – tomka Sep 28 '20 at 09:16
1

The question title is *"Which effect coding (categorical encoding) to use if I want the model intercept to have the interpretation of the global mean?"* and I have answered that - it's possible with balanced data, but not with unbalanced data. If the question had been *"how can I obtain an intercept that is equal to the grand mean"* that it would be different. This site works best when question are answered "as asked" and any furthter question is asked in a new post. Please don't change the existing post to alter the question. – Robert Long Sep 28 '20 at 09:22
Again I disagree. My question was stated in the title as you quote but *I erred* in that my objective could be reached by *effect coding alone*. In fact I *thought* it could be reached by effect coding but your useful answer has demonstrated that it cannot be reached by effect coding alone. At no point in my question I said that the levels of x are balanced and in the commenting body I clarified (before you posted your answer) that "The categorical variable x~ may be assumed to be multinomially distributed with unequal class probabilities." – tomka Sep 28 '20 at 09:57
NB: CV explicitly asks to work on questions to improve/clarify them, not post new questions. I could even change the title of my question now in order to improve it – tomka Sep 28 '20 at 09:57
4

The improvement process is generally one in response to comments on the original question, not on comments to an answer that you don't like. Anyway I won't comment further. I tried my best to answer the question posed and have offered my advise on how you might be able to solve the problem of unbalanced data which was not apparent from the question originally asked. I would be happy to try to answer a new question. – Robert Long Sep 28 '20 at 10:15
@BenBolker's first comment to your answer was that it only works for balanced data. So clearly he had caught the drift of my question. Even if I asked a new question now, it would be semantically equivalent accept for two or three words that I edited in/out now (I added a remark about the earlier version). I think whether or not the first version included the case of unbalanced classes is just hairsplitting. Again, I think your answer is useful, but it omitted the non-balanced case which I had not excluded from my question. I'd be happy if you answer this question in its clarified form. – tomka Sep 28 '20 at 10:33
2

OK fair enough, I've updated my answer :) – Robert Long Sep 28 '20 at 10:59
1

@RobertLong Thank you. I am not sure about the `betas` you use in the marginal mean formula. Shouldn't these be the estimates from the regression instead? You seem to use the exact true values which are unknown in practice. – tomka Sep 29 '20 at 08:48
1

Ahh, yes you are quite right, and that also makes the calculation exact ! I have updated the answer. – Robert Long Sep 29 '20 at 09:21
1

It seems like the post-hoc estimate of $E(y)$ also has a smaller standard error than the naive estimate. I find this particularly counter-intuitive. If you call your model `fit`, then you can get the SE using emmeans: `emmeans(fit, ~1, weights = 'proportional')` – JTH Sep 29 '20 at 12:02
One possible issue is that it is a post-hoc estimate of the global mean. My objective was (see question) giving the model intercept $\beta_{00}$ the interpretation of the global mean, because I also wanted to interpret the random effect of the intercept as one of global mean differences across clusters in the mixed effect model. I believe such an interpretation is not possible using the post-hoc estimate. – tomka Sep 30 '20 at 10:55
@JTH By naive estimate you mean the sample mean, right? This is plausible as regression estimation of means using covariates related to the outcome is known to improve estimate precision. This is used in survey statistics for example to increase estimates' precision using exogenous information on the sample and population level. – tomka Sep 30 '20 at 10:58

Which categorical variable encoding should I use if I want the model intercept to have the interpretation of the global mean?

1 Answers1