I want to compare a parameter between two independent groups of volunteers (see MWE below). All volunteers had baseline reading (V1) and a subsequent reading by a later visit (V2) for intervention (Interv
; Yes/No). The parameters are gene expression data and expressed in fold change (i.e. V2/V1). For brevity, I mentioned only one Par1
. I came from here, but I didn't find what I want though how to report was updated in the question but it was not addressed yet I guess, so it might deserve a separate question like this one here.
To report the central tendency, I am intending to use the median
as it is less affected by outliers, may be with MAD
(Median Absolute Deviation), though not very often used, but I think is the equivalent to SD
for the mean
.
Subjects without intervention were (n=15
), whereas those with intervention were (n=19
). I am not a statistician, but I know this from my field that when it comes to gene expression data it is always good to log transform the data before doing analysis (to make the data mimic the normal distribution). So to report the median
I used the fold change reading because it is more intuitive to the reader and makes more sense than reporting log values.
To measure the difference between the two groups as inferential statistics, I am more inclined to use the permutation
techniques available in R
, like the coin
package, than to use t-test
, thinking that is more robust and less demanding of assumptions (confession: biologists do not respect these aliens, they will meet anybody except assumptions).
After log transformation, I did permutation test, wilcox test (Mann-Whitney-U-test), and t-test just for sake of comparison and learning.
The dataset (use dget(<code below>)
to get the dataset:
> dput(dfFold)
structure(list(Interv = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Yes",
"No"), class = "factor"), variable = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Par1",
"Par2"), class = "factor"), Fold = c(181.552149430047, 365.287611324048,
1.37615543475921, 29.4749766533613, 7.30460900092265, 0.55729573732714,
0.856344617183982, 12.2151438678328, 8.22023432258123, 7.52091826273063,
11.3960901635291, 141.40637778502, 1.19646833351009, 4234.12741125762,
9410.18608208175, 0.11188317488791, 3.39142462873992, 1.54986235136038,
0.138010480189387, 0.150720706883034, 8.58182709551405, 1.65995914129576,
1.02664060360863, 40.9872498215584, 51.7412085216083, 3.64546611583102,
5.64546161492277, 8.92993632885674, 4.71129389070826, 1497.44850394305,
0.0209326180515642, 2.90945271783875, 5.09753127386217, 9.85421224939179
), Log = c(5.20154293711853, 5.90068501952248, 0.31929369443907,
3.38354165434038, 1.98850551891357, -0.584659233229834, -0.15508239368987,
2.50267648260455, 2.10659871505853, 2.01768823991267, 2.43327032846195,
4.95163785700696, 0.179374162082163, 8.35093254365055, 9.14954800731289,
-2.19030003343294, 1.22125007758044, 0.438166121413757, -1.98042565330713,
-1.89232677812372, 2.14964683902473, 0.506792988388275, 0.0262919218996518,
3.71326103837802, 3.9462545340732, 1.29348423542785, 1.73085196838103,
2.18940926483575, 1.54996258170508, 7.31151794138033, -3.86644666427115,
1.0679649940249, 1.62875635858491, 2.28789900330443)), .Names = c("Interv",
"variable", "Fold", "Log"), class = "data.frame", row.names = c(NA,
-34L))
How it looks like:
> head(dfFold)
Interv variable Fold Log
1 Yes Par1 181.5521494 5.2015429
2 Yes Par1 365.2876113 5.9006850
3 Yes Par1 1.3761554 0.3192937
4 Yes Par1 29.4749767 3.3835417
5 Yes Par1 7.3046090 1.9885055
6 Yes Par1 0.5572957 -0.5846592
To report the central tendency (I included mean
also):
> median(dfFold[dfFold$Interv=="Yes",]$Fold,)
[1] 11.39609
> median(dfFold[dfFold$Interv=="No",]$Fold,)
[1] 3.645466
> mean(dfFold[dfFold$Interv=="Yes",]$Fold,)
[1] 960.8452
> mean(dfFold[dfFold$Interv=="No",]$Fold,)
[1] 86.71587
Now the the tests, permutation
, then wilcox
test from coin
, and t-test
from stats
package:
> pvalue(oneway_test(Log~Interv, data=dfFold ,distribution=approximate(B=9999),conf.int=TRUE))
[1] 0.03590359
99 percent confidence interval:
0.03128298 0.04097196
Warning message:
In independence_test.IndependenceProblem(object, check = check, :
additional arguments conf.int will be ignored
> set.seed(1234)
> pvalue(wilcox_test(Log~Interv, data=dfFold,distribution=approximate(B=9999),conf.int=TRUE))
[1] 0.05890589
99 percent confidence interval:
0.05300214 0.06523398
> t.test (Log~Interv, data=dfFold)
Welch Two Sample t-test
data: Log by Interv
t = 2.1519, df = 27.43, p-value = 0.04036
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09772892 4.04392270
sample estimates:
mean in group Yes mean in group No
3.183037 1.112211
I got a warning message from the oneway_test()
, anybody can help explain that?
So permutation test with approximation (B=9999) the difference was significant at 0.05 level, in agreement with the t-test
, but not with the wilcox
test.
Now the question:
If I were to report this difference based on the permutation test what should I mention? Do I have to include the confidence intervals of 99% of oneway_test()
? (see the output), how to change it to 95%? and what is the difference in concept of statistic between oneway_test()
and t-test
? isn't permutation
a computer-intensive technique that can be used with many statistical tests? but when I do it with oneway_test()
what kind of test this function eventually uses? Any help to raise my statistical knowledge would be appreciated. Also few lines of comment on the result as one would be expected to write in a report would be much appreciated alike.