I am analysing a double knockout (mouse), and trying to see whether restoring one or another of the genes it's missing affects specific positions preferentially in a third gene.
The data* looks like this: /non-DNA simplification below for the non-biologists/
Position DoubleKO DoubleKO+1 DoubleKO+2
Base: A G A G A G
1 1 2 3 4 5 6
2 7 8 9 10 11 12
...
N=~500 1 2 3 4 5 6
* counted on a per-position basis from multiple runs of Sanger sequencing - so not NGS, and the numbers are low
So the task is to find the position, or positions, (which ranges from 1 to 500) in which the ratio of G:A (bases, I'm only considering two, not all four in the DNA) is different in a statistically significant manner for
[(DoubleKO+1) vs (DoubleKO)] and [(DoubleKO+2) vs (DoubleKO)]
Note the following contrast is not of interest:
[(DoubleKO+1) + (DoubleKO+2)] vs (DoubleKO)
Questions:
1. Which statistical test or tests should I use?
2. How should I correct for multiple testing in this experiment?
Please note:
- Yes, NGS is more informative. This isn't technically feasible at the moment.
- I can have a count for each of the bases (ATGC) + uncalled (N) at each point as well, and this encapsulates the universe of possibilities that "base" can be. So if we need to have this information, we do - but the main interest is these two bases - since they are chemically converted into each other, and can't become C or T in this context.
- I have read the following other questions/links, and can't see quite how they should be applied to my problem:
- https://www.broadinstitute.org/cancer/cga/mutect
- https://www.broadinstitute.org/cancer/cga/mutsig
- optimised for lots of cancer/normal samples
- I am, essentially, trying to compare mutations in cancer type A vs mutations in cancer type B.
- I am also in genetically identical mice, not people, so the noise of my mutation rate can be assumed to be 0 (i.e. I do not expect spontaneous changes that are not a factor of my treatments)
- Comparing mutation frequency between a case and a pool of controls
- here the person had one tumour vs multiple controls, i.e. a different experimental design.
Thanks in advance!
Non-DNA simplification:
Imagine that there are ~500 houses in a neighbourhood, in each of which we can have a family of mum + dad + 2 kids living. We want to identify houses which are more conducive to getting kid A or kid B (but not both) to help out with the housework, relative to just mum and dad:
FamilyStructure JustMumAndDad Mum+Dad+KidA Mum+Dad+KidB
House number HoursMum HoursDad HoursMum HoursDad HoursMum HoursDad
1 1 2 3 4 5 6
2 3 4 9 5 7 6
...
500 1 2 3 4 5 6