2

I am having difficulty setting up this model matrix. I have been looking through some tutorials and questions online but I can't seem to find the answer to my problem. I am at the point where I have decided I probably have no idea what I am doing. Here is my sample table:

structure(list(condition = structure(c(2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Ctrl", "Drug1", 
"Drug1_Drug2"), class = "factor"), genotype = structure(c(2L, 
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("KO", 
"WT"), class = "factor"), replicate = c(1, 2, 3, 1, 1, 2, 3, 
1, 2, 3, 1, 2, 3, 1, 2, 3)), .Names = c("condition", "genotype", 
"replicate"), row.names = c("Drug1_WT1", "Drug1_WT2", "Drug1_WT3", 
"Drug1_KO1", "Drug1_Drug2_WT1", "Drug1_Drug2_WT2", "Drug1_Drug2_WT3", 
"Drug1_Drug2_KO1", "Drug1_Drug2_KO2", "Drug1_Drug2_KO3", "Ctrl_WT1", 
"Ctrl_WT2", "Ctrl_WT3", "Ctrl_KO1", "Ctrl_KO2", "Ctrl_KO3"), class = "data.frame")

Which looks like this when formatted:

                  condition genotype replicate
Drug1_WT1             Drug1       WT         1
Drug1_WT2             Drug1       WT         2
Drug1_WT3             Drug1       WT         3
Drug1_KO1             Drug1       KO         1
Drug1_Drug2_WT1 Drug1_Drug2       WT         1
Drug1_Drug2_WT2 Drug1_Drug2       WT         2
Drug1_Drug2_WT3 Drug1_Drug2       WT         3
Drug1_Drug2_KO1 Drug1_Drug2       KO         1
Drug1_Drug2_KO2 Drug1_Drug2       KO         2
Drug1_Drug2_KO3 Drug1_Drug2       KO         3
Ctrl_WT1               Ctrl       WT         1
Ctrl_WT2               Ctrl       WT         2
Ctrl_WT3               Ctrl       WT         3
Ctrl_KO1               Ctrl       KO         1
Ctrl_KO2               Ctrl       KO         2
Ctrl_KO3               Ctrl       KO         3

Everything in the genotype column labeled "WT" is my signal and everything labeled "KO" is my background. So what I essentially want is a way to make the following comparisons where the background (KO) is separated from the signal (WT). I am feeding my model matrix to an R package that will do the rest, I just need to make sure it is set up correctly.

  • Drug1WT v. CtrlWT (with Drug1KO and Ctrl1KO as background)
  • Drug1_Drug2WT v. CtrlWT (with Drug1_Drug2KO and CtrlKO as background)
  • Drug1WT v. Drug1_Drug2WT (with Drug1KO and Drug1_Drug2KO as background)

Additionally, I would also like to compare:

  • Drug1WT v. Drug1KO
  • Drug1_Drug2WT v. Drug1_Drug2KO
  • Ctrl_WT v. Ctrl_KO

I am not sure if these comparisons make sense the way I have my table set up. Here is what I have tried. The sample table is assigned to dat.

with(dat, model.matrix(~ genotype * condition))

This gives me:

   (Intercept) conditionDrug1 conditionDrug1_Drug2 genotypeWT
1            1              1                    0          1
2            1              1                    0          1
3            1              1                    0          1
4            1              1                    0          0
5            1              0                    1          1
6            1              0                    1          1
7            1              0                    1          1
8            1              0                    1          0
9            1              0                    1          0
10           1              0                    1          0
11           1              0                    0          1
12           1              0                    0          1
13           1              0                    0          1
14           1              0                    0          0
15           1              0                    0          0
16           1              0                    0          0
   conditionDrug1:genotypeWT conditionDrug1_Drug2:genotypeWT
1                          1                               0
2                          1                               0
3                          1                               0
4                          0                               0
5                          0                               1
6                          0                               1
7                          0                               1
8                          0                               0
9                          0                               0
10                         0                               0
11                         0                               0
12                         0                               0
13                         0                               0
14                         0                               0
15                         0                               0
16                         0                               0
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$condition
[1] "contr.treatment"

attr(,"contrasts")$genotype
[1] "contr.treatment"

Questions:

  1. Why isn't "KO" included in any of the columns of the design matrix?
  2. How does model.matrix() choose which coefficients are part of the model? There are combinations I am expected that are missing.
  3. Based on what I am trying to achieve, how do I set up contrasts to indicate what I want to compare?

Additionally, any links to tutorials or external sources that could help me out would be appreciated.

syntonicC
  • 195
  • 1
  • 1
  • 11

1 Answers1

2

You have to take note that the design matrix $X$ has to be full rank. That means that its columns cannot be a linear combination of one another. As you do include an intercept (which is correct and should be done in general) your factors (say genotype) cannot be fully encoded. If you had a column genotypeKO and a column genotypeWT where each columns was simply an indicator of the genotype the combination of these two columns would equal your intercept. (The column of 1's in the beginning of your design matrix). That's why you do not have a column KO. model.matrix automatically excludes the first level of the expanded factor you are considering as to prevent linear combinations (read rank-deficient design matrices) to arise. This does not guarantee that the matrix will be full-rank but it makes it more plausible.

Given your design matrix $X$ the easiest way to quickly check its rank in R is to use the QR decomposition like: qr(X)$rank to reveal the rank of it. If the rank equates the number of columns (assuming you have more columns than rows) then the matrix is full-rank and the system of equations defined by $X$ has a unique solution. These online manual pages from UCLA on variables coding are very enlightening on the matter. This CV thread on polynomial contrast encoding (while not directly applicable in your case) should also be a helpful read.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • Using QR decomposition, $X$ is rank 6. When I count columns, I assume I include the intercept column? If so, then my matrix is 16 X 6. Although the rank equals the numbers of columns I have more rows than columns. I am not sure where to go next but I may have a better idea after I read the links in your post. – syntonicC Sep 25 '16 at 19:31
  • From what I have been reading I think my `genotype` column is redundant. Is this correct? If I recode my sample table so that the `condition` column contains the information about being either WT or KO, then I get all the columns I want. But then I am not sure how to specify that the KO samples should be an interaction term. – syntonicC Sep 25 '16 at 20:19
  • Having more rows than columns is fine, actually this is exactly what you want to ensure a unique solution. Interactions terms are specified by `:`. A good succinct reference on Wilkinson Notation can be found [here](http://uk.mathworks.com/help/stats/wilkinson-notation.html) (Yes, it is for MATLAB but yes, the notation is common.) – usεr11852 Sep 25 '16 at 23:39
  • This is an excellent answer to your question @syntonicC. I think you should seriously consider "accepting" it. – Antoni Parellada Sep 26 '16 at 22:44
  • @ Antoni Parrellada Well, I'm still kind of stuck at the moment. This answer points out an error with my setup (which was very helpful) and answers my questions 2 and 3 but I am still not sure how I need to restructure my table to get the contrasts I need (I barely understand the idea of contrast coding at the moment). I have been reading the sources posted trying to understand how this all works. I was intending to come back here when I have a better idea to ask further clarification. – syntonicC Sep 26 '16 at 23:02