5

I am preparing a data analysis for a longitudinal study investigating the effects of two treatments (A & B) over time with the following 4 groups:

  • Group 1: Control (Saline)
  • Group 2: Treatment A
  • Group 3: Treatment B
  • Group 4: Treatment A + Treatment B (same doses as group 2 & 3)

I am considering the following encoding schemes:

Make Saline, Treatment A, and Treatment B separate binary categorical variables to create a binary treatment vector [I(Saline) I(Tx A) I(Tx B)] (where I(x) is indicator function which equals 1 if x is present & 0 otherwise) with values as follows:

  • Group 1 (Control - Saline): [1 0 0]
  • Group 2 (Treatment A): [0 1 0]
  • Group 3 (Treatment B): [0 0 1]
  • Group 4 (Treatment A + B): [0 1 1]

Make Treatment Group a single categorical variable with 4 non-ordered categories:

  • Group 1 (Control - Saline) = 1
  • Group 2 (Treatment A) = 2
  • Group 3 (Treatment B) = 3
  • Group 4 (Treatment A + B) = 4

Make Treatment Group a single categorical variable with saline as base category & 3 non-ordered categories:

  • Group 1 (Control - Saline) = 0
  • Group 2 (Treatment A) = 1
  • Group 3 (Treatment B) = 2
  • Group 4 (Treatment A + B) = 3

Which encoding method would be the most appropriate?

Robert Long
  • 53,316
  • 10
  • 84
  • 148
user294162
  • 51
  • 1

1 Answers1

4

They are all the same. Well, I'm not completely clear what you mean in your first example, but we will get to that. Basically, you just have to look a the design matrix corresponding to each way of encoding the variable. We can do this easily in R with a simple example:

> expand.grid(trt = c("1","2","3","4"), reps = 1:2) %>%  model.matrix(~ trt, .)
  (Intercept) trt2 trt3 trt4
1           1    0    0    0
2           1    1    0    0
3           1    0    1    0
4           1    0    0    1
5           1    0    0    0
6           1    1    0    0
7           1    0    1    0
8           1    0    0    1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$trt
[1] "contr.treatment"

> expand.grid(trt = c("0","1","2","3"), reps = 1:2 )  %>%  model.matrix(~ trt, .)
  (Intercept) trt1 trt2 trt3
1           1    0    0    0
2           1    1    0    0
3           1    0    1    0
4           1    0    0    1
5           1    0    0    0
6           1    1    0    0
7           1    0    1    0
8           1    0    0    1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$trt
[1] "contr.treatment"

which as you can see, are the same. R uses contrast coding by default, we could use, for example helmert coding:

expand.grid(trt = c("1","2","3","4"), reps = 1:2 )  %>%  model.matrix(~ trt, . , contrasts = list(trt = "contr.helmert"))

or orthogonal polynomials (contrasts = list(trt = "contr.poly")) which would result in different design matrices, but for each way the factor trt is encoded ("1","2","3","4" or "0","1","2","3") they would be the same, resulting in the same model output.

In your first example, it is not clear what you mean. The way you have written [1 0 0], [0 1 0],[0 0 1] and [0 1 1] makes it seem like you might be thinking of setting your own contrasts, in which cases this question and answer will tell you all about that: What is a contrast matrix? . On the other hand if those are just binary numbers then they evaluate to 4, 2, 1 and 3 respectively, and if you coded those as levels ("4", "2", "1" "3") then again you would obtain she same design matrix and thus the same model. You could even code them as ("1 0 0", "0 1 0", "0 0 1", "0 1 1") and yet again it would be the same.

Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • Does this answer your question ? If so, please consider marking it as the accepted answer, and if not please let us know why. – Robert Long Aug 28 '20 at 19:18