My data represents the count of mutations by gene in two groups cases and ctrl. I would like to do comparison gene-wise between the two groups. But in my case I don’t think anova is appropriate since count data follows a Poisson distribution. Here a small subset of my Db, as you can see the samples size (cases and ctrls is different).
gene Samples_ID value Type
AB 1 28 Cases
AB 100 22 Cases
AB 101 36 Cases
AB 102 57 Cases
AB 105 29 Cases
AB 106 25 Cases
AB 108 23 Cases
AB 4928 18 Ctrls
AB 4929 18 Ctrl
AB 4930 24 Ctrl
AB 4931 20 Ctrl
AB 4932 25 Ctrl
AB 4933 15 Ctrl
AB 4934 25 Ctrl
AB 4935 22 Ctrl
AB 4936 30 Ctrl
AB 4937 15 Ctrl
AB 4938 18 Ctrl
AB 4939 21 Ctrl
FG 1 21 Cases
FG 100 16 Cases
FG 101 21 Cases
FG 102 34 Cases
FG 105 22 Cases
FG 106 23 Cases
FG 108 23 Cases
FG 4928 8 Ctrl
FG 4929 3 Ctrl
FG 4930 7 Ctrl
FG 4931 6 Ctrl
FG 4932 5 Ctrl
FG 4933 15 Ctrl
FG 4934 8 Ctrl
FG 4935 11 Ctrl
FG 4936 1 Ctrl
FG 4937 7 Ctrl
FG 4938 6 Ctrl
FG 4939 8 Ctrl
SYU 1 27 Cases
SYU 100 23 Cases
SYU 101 35 Cases
SYU 102 39 Cases
SYU 105 24 Cases
SYU 106 25 Cases
SYU 108 30 Cases
SYU 4928 5 Ctrl
SYU 4929 6 Ctrl
SYU 4930 6 Ctrl
SYU 4931 16 Ctrl
SYU 4932 5 Ctrl
SYU 4933 11 Ctrl
SYU 4934 12 Ctrl
SYU 4935 11 Ctrl
SYU 4936 15 Ctrl
SYU 4937 8 Ctrl
SYU 4938 13 Ctrl
SYU 4939 11 Ctrl
Where "gene" is the gene name, "Samples_ID" represent the patient, "value" is the number of mutation for the given gene, "type" is the group.
I have tried a generalized linear model as follows:
fit <- glm(value ~ gene + Type, data = file, family = poisson())
but I'm not convinced it's the right way.
PS
Please, give me a simple explanation, I'm not a statistician and my background about statistic is ~ 0.
Thank you in advance.
Best.
Update
The number of cases is 98
The number oc ctrls is 40
Number of genes is 2780
Mutations are RNA editing events.