61

I have data from an experiment that I analyzed using t-tests. The dependent variable is interval scaled and the data are either unpaired (i.e., 2 groups) or paired (i.e., within-subjects). E.g. (within subjects):

x1 <- c(99, 99.5, 65, 100, 99, 99.5, 99, 99.5, 99.5, 57, 100, 99.5, 
        99.5, 99, 99, 99.5, 89.5, 99.5, 100, 99.5)
y1 <- c(99, 99.5, 99.5, 0, 50, 100, 99.5, 99.5, 0, 99.5, 99.5, 90, 
        80, 0, 99, 0, 74.5, 0, 100, 49.5)

However, the data are not normal so one reviewer asked us to use something other than the t-test. However, as one can easily see, the data are not only not normally distributed, but the distributions are not equal between conditions: alt text

Therefore, the usual nonparametric tests, the Mann-Whitney-U-Test (unpaired) and the Wilcoxon Test (paired), cannot be used as they require equal distributions between conditions. Hence, I decided that some resampling or permutation test would be best.

Now, I am looking for an R implementation of a permutation-based equivalent of the t-test, or any other advice on what to do with the data.

I know that there are some R-packages that can do this for me (e.g., coin, perm, exactRankTest, etc.), but I don't know which one to pick. So, if somebody with some experience using these tests could give me a kick-start, that would be ubercool.

UPDATE: It would be ideal if you could provide an example of how to report the results from this test.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Henrik
  • 13,314
  • 9
  • 63
  • 123

6 Answers6

51

It shouldn't matter that much since the test statistic will always be the difference in means (or something equivalent). Small differences can come from the implementation of Monte-Carlo methods. Trying the three packages with your data with a one-sided test for two independent variables:

DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
library(coin)                    # for oneway_test(), pvalue()
pvalue(oneway_test(DV ~ IV, alternative="greater", 
                   distribution=approximate(B=9999)))
[1] 0.00330033

library(perm)                    # for permTS()
permTS(DV ~ IV, alternative="greater", method="exact.mc", 
       control=permControl(nmc=10^4-1))$p.value
[1] 0.003

library(exactRankTests)          # for perm.test()
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.003171822

To check the exact p-value with a manual calculation of all permutations, I'll restrict the data to the first 9 values.

x1 <- x1[1:9]
y1 <- y1[1:9]
DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
pvalue(oneway_test(DV ~ IV, alternative="greater", distribution="exact"))
[1] 0.0945907

permTS(DV ~ IV, alternative="greater", exact=TRUE)$p.value
[1] 0.0945907

# perm.test() gives different result due to rounding of input values
perm.test(DV ~ IV, paired=FALSE, alternative="greater", exact=TRUE)$p.value
[1] 0.1029412

# manual exact permutation test
idx  <- seq(along=DV)                 # indices to permute
idxA <- combn(idx, length(x1))        # all possibilities for different groups

# function to calculate difference in group means given index vector for group A
getDiffM <- function(x) { mean(DV[x]) - mean(DV[!(idx %in% x)]) }
resDM    <- apply(idxA, 2, getDiffM)  # difference in means for all permutations
diffM    <- mean(x1) - mean(y1)       # empirical differencen in group means

# p-value: proportion of group means at least as extreme as observed one
(pVal <- sum(resDM >= diffM) / length(resDM))
[1] 0.0945907

coin and exactRankTests are both from the same author, but coin seems to be more general and extensive - also in terms of documentation. exactRankTests is not actively developed anymore. I'd therefore choose coin (also because of informative functions like support()), unless you don't like to deal with S4 objects.

EDIT: for two dependent variables, the syntax is

id <- factor(rep(1:length(x1), 2))    # factor for participant
pvalue(oneway_test(DV ~ IV | id, alternative="greater",
                   distribution=approximate(B=9999)))
[1] 0.00810081
caracal
  • 11,549
  • 49
  • 63
  • Thanks for your great answer! 2 more questions: Does your second example mean, that coin in fact does provide all possible permutation and is an exact test? Is there any benefit of not providing an exact test in my case? – Henrik Jan 10 '11 at 13:45
  • 12
    (+1) It is no surprise that the (unpaired) t-test yields essentially the same p-value, 0.000349. Despite what the reviewer said, the t-test *is* applicable to these data. The reason is that the *sampling distributions* of the means are approximately normal, even though the distributions of the data are not. Moreover, as you can see from the results, the t-test actually is more conservative than the permutation test. (This means that a significant result with the t-test indicates the permutation test is likely to be significant, too.) – whuber Jan 10 '11 at 14:09
  • 2
    @Henrik For certain situations (chosen test and numerical complexity), `coin` can indeed calculate the exact permutation distribution (without actually going through all permutations, there are more elegant algorithms than that). Given the choice, the exact distribution seems preferable, but the difference to a Monte-Carlo approximation with a high number of replicates should be small. – caracal Jan 10 '11 at 14:26
  • 1
    @Caracal Thanks for the clarification. One question remains: The data I presented is paired. Hence, I need the equivalent to the paired t-test. Is `oneway_test` the accurate function? And if so, which one is the correct for non-paired data? – Henrik Jan 10 '11 at 14:34
  • @Henrik An exact permutation test needs 19! = 121 645 100 408 832 000 permutations. Assuming a really fast 10^9 applications per second, that would require 3.87 years of computation. (A clever algorithm would exploit all the repetitions among the x's to reduce that to one second, but I think you would have to hand-code that.) – whuber Jan 10 '11 at 14:34
  • @Henrik You're in trouble for paired data with a nonparametric test: the p-value is around 40%. The problem is that there are some (inconsequential?) increases from x to y, a number of tied values, and some really big decreases from x to y. By ignoring the magnitude of the decreases you lose too much information. – whuber Jan 10 '11 at 14:36
  • @whuber: when using `wilcox.test` from stats with paired = T, I get a p-value of .02, which is reasonable. So I cannot replicate the .4 you mention. However, I am still unable to get a paired test to work with coin. (When using `wilcoxsign_test` I get an error: Not an unreplicated complete block design) – Henrik Jan 10 '11 at 14:52
  • @Henrik Wilcox.test assumes the distribution of x-y is *symmetric* about its mean. It is also unclear whether wilcox.test handles ties appropriately. Thus I think your p-value of 0.02 is invalid. – whuber Jan 10 '11 at 16:25
  • @whuber, re: exhaustive permutations -> The exhaustive variant of the permutation test is sometimes termed a "randomization test" (eg, by Mewhort et al, 2009, http://www.springerlink.com/content/pg17u7141845w082/) and advances by Gill (2007, discussed in the Mewhort paper) dramatically reduce the computational requirements. – Mike Lawrence Jan 10 '11 at 17:04
  • @Mike Thank you. I don't have access to the paper, so I can only guess that it might be doing something similar to what I had in mind when I suggested 4 years of calculation could be reduced to one second :-). – whuber Jan 10 '11 at 17:54
  • @whuber: The preprint of the Mewhort paper can be accessed via his website: http://psyc.queensu.ca/~hiplab/LAB_PUBS/BSC903_Mewhort_Kelly_Johns_BRM.pdf – Mike Lawrence Jan 10 '11 at 19:47
  • @caracal. Thanks for the update. But what is the rational for not using the exact statistics but an approximation? (The results are markedly different...) – Henrik Jan 11 '11 at 09:50
  • @caracal I would be grateful if you could give some hints. as the difference is huge: p = .01 (approx) and p = .15 (exact) – Henrik Jan 11 '11 at 15:31
  • 2
    @Henrik The `coin` author wrote me that `oneway_test()` cannot calculate the exact distribution for the dependent case, you have to go with the MC approximation (only `wilcoxsign_test()` is suitable for the exact test). I didn't know this and would prefer an error in that case, but MC should be accurate enough with a high number of replicates. – caracal Jan 11 '11 at 19:30
  • @Henrik, does MC stands for Monte Carlo? In practice, iIs setting the parameter `distribution=approximate(B=9999)` (instead of `distribution="exact"`) what you are talking about? – toto_tico Nov 09 '15 at 21:03
  • Is there a way to use oneway_test in coin to get the estimated of the differences in means between the control and treatment groups – News_is_Selection_Bias Dec 15 '21 at 12:24
32

A few comments are, I believe, in order.

1) I would encourage you to try multiple visual displays of your data, because they can capture things that are lost by (graphs like) histograms, and I also strongly recommend that you plot on side-by-side axes. In this case, I do not believe the histograms do a very good job of communicating the salient features of your data. For example, take a look at side-by-side boxplots:

boxplot(x1, y1, names = c("x1", "y1"))

alt text

Or even side-by-side stripcharts:

stripchart(c(x1,y1) ~ rep(1:2, each = 20), method = "jitter", group.names = c("x1","y1"), xlab = "")

alt text

Look at the centers, spreads, and shapes of these! About three-quarters of the $x1$ data fall well above the median of the $y1$ data. The spread of $x1$ is tiny, while the spread of $y1$ is huge. Both $x1$ and $y1$ are highly left-skewed, but in different ways. For example, $y1$ has five (!) repeated values of zero.

2) You didn't explain in much detail where your data come from, nor how they were measured, but this information is very important when it comes time to select a statistical procedure. Are your two samples above independent? Are there any reasons to believe that the marginal distributions of the two samples should be the same (except for a difference in location, for example)? What were the considerations prior to the study that led you to look for evidence of a difference between the two groups?

3) The t-test is not appropriate for these data because the marginal distributions are markedly non-normal, with extreme values in both samples. If you like, you could appeal to the CLT (due to your moderately-sized sample) to use a $z$-test (which would be similar to a z-test for large samples), but given the skewness (in both variables) of your data I would not judge such an appeal very convincing. Sure, you can use it anyway to calculate a $p$-value, but what does that do for you? If the assumptions aren't satisfied then a $p$-value is just a statistic; it doesn't tell what you (presumably) want to know: whether there is evidence that the two samples come from different distributions.

4) A permutation test would also be inappropriate for these data. The single and often-overlooked assumption for permutation tests is that the two samples are exchangeable under the null hypothesis. That would mean that they have identical marginal distributions (under the null). But you are in trouble, because the graphs suggest that the distributions differ both in location and scale (and shape, too). So, you can't (validly) test for a difference in location because the scales are different, and you can't (validly) test for a difference in scale because the locations are different. Oops. Again, you can do the test anyway and get a $p$-value, but so what? What have you really accomplished?

5) In my opinion, these data are a perfect (?) example that a well chosen picture is worth 1000 hypothesis tests. We don't need statistics to tell the difference between a pencil and a barn. The appropriate statement in my view for these data would be "These data exhibit marked differences with respect to location, scale, and shape." You could follow up with (robust) descriptive statistics for each of those to quantify the differences, and explain what the differences mean in the context of your original study.

6) Your reviewer is probably (and sadly) going to insist on some sort of $p$-value as a precondition to publication. Sigh! If it were me, given the differences with respect to everything I would probably use a nonparametric Kolmogorov-Smirnov test to spit out a $p$-value that demonstrates that the distributions are different, and then proceed with descriptive statistics as above. You would need to add some noise to the two samples to get rid of ties. (And of course, this all assumes that your samples are independent which you didn't state explicitly.)

This answer is a lot longer than I originally intended it to be. Sorry about that.

  • I'm curious if you'd consider the following an appropriate quasi-visualized approach: bootstrap estimates for the moments of the two groups (means, variances, and higher moments if you so desire) then plot these estimates and their confidence intervals, looking for the degree of overlap between groups on each moment. This lets you talk about potential differences across the variety of distribution characteristics. If the data are paired, compute the difference scores and bootstrap the moments of this single distribution. Thoughts? – Mike Lawrence Jan 10 '11 at 17:21
  • 2
    (+1) Good analysis. You're perfectly right that the results are obvious and one doesn't need to press the point with a p-value. You may be a little extreme in your statement of (3), because the t-test does not require normally distributed data. If you are concerned, there exist adjustments for skewness (e.g., Chen's variant): you could see whether the p-value for the adjusted test changes the answer. If not, you're likely ok. In this particular situation, with these (highly skewed) data, the t-test works fine. – whuber Jan 10 '11 at 17:58
  • (+1) Nice catch! And very good comments. – chl Jan 10 '11 at 18:26
  • We seem to be accepting the notion that the underlying distribution is "similar" to the random instantiation. So couldn't one pose the question: are these both from beta(0.25, 0.25) and then the test would be whether they have the same (non-)centrality parameter. And wouldn't that justify using either a permutation test or Wilcoxon? – DWin Jan 10 '11 at 21:13
  • +1 Thanks for the detailed answer. One point: The data is paired. I.e., Each data pair is responses from a single participant on different trial types. We constructed the trial in such a way to expect differences in that way. Furthermore, I have to present the results (i.e., there exists differences) as briefly as possible. So, I have to see how I can use your analysis. – Henrik Jan 11 '11 at 10:22
  • 5
    There is a lot of good info here, but it is very strongly worded & mixed w/ some things that don't make much sense to me. Eg, re #3, my understanding of the relationship b/t the z-test & the t-test is that they're essentially the same, but z is used when the SD is known a-priori & t is used when the SD is estimated from the data. I don't see how if the CLT guarantees the normality of the sampling dists this licenses the z-test while leaving the t-test invalid. I also don't think $\neq$ SD's invalidates t if the CLT covers you, so long as an adjustment (eg, Welch–Satterthwaite) is used. – gung - Reinstate Monica May 24 '12 at 16:58
6

As this question did pop up again, I may add another answer inspired by a recent blog post via R-Bloggers from Robert Kabacoff, the author of Quick-R and R in Action using the lmPerm package.

However, this methods produces sharply contrasting (and very unstable) results to the one produced by the coin package in the answer of @caracakl (the p-value of the within-subjects analysis is 0.008). The analysis takes the data preparation from @caracal's answer as well:

x1 <- c(99, 99.5, 65, 100, 99, 99.5, 99, 99.5, 99.5, 57, 100, 99.5, 
        99.5, 99, 99, 99.5, 89.5, 99.5, 100, 99.5)
y1 <- c(99, 99.5, 99.5, 0, 50, 100, 99.5, 99.5, 0, 99.5, 99.5, 90, 
        80, 0, 99, 0, 74.5, 0, 100, 49.5)

DV <- c(x1, y1)
IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
id <- factor(rep(1:length(x1), 2)) 

library(lmPerm)

summary(aovp( DV ~ IV + Error(id)))

produces:

> summary(aovp( DV ~ IV + Error(id)))
[1] "Settings:  unique SS "

Error: id
Component 1 :
          Df R Sum Sq R Mean Sq
Residuals 19    15946       839


Error: Within
Component 1 :
          Df R Sum Sq R Mean Sq Iter Pr(Prob)  
IV         1     7924      7924 1004    0.091 .
Residuals 19    21124      1112                
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

If you run this multiple times, the p-values jumps around between ~.05 and ~.1.

Although it is an answer to the question let me allow to pose a question at the end (I can move this to a new question if desired):
Any ideas of why this analysis is so unstable and does produce a so diverging p-values to the coin analysis? Did I do something wrong?

Henrik
  • 13,314
  • 9
  • 63
  • 123
  • 2
    It may be better to ask this as a separate question, if it really is a question you want answered, rather than another possible solution you want to list w/ the rest. I notice that you specify an error strata, but @caracal does not; that would be my first guess at the difference b/t this output & his. Also, when simulating, values typically jump around; for reproducibility, you specify the seed, eg `set.seed(1)`; for greater precision in the MC estimate, you increase the number of iterations; I'm unsure if either of those are the 'right' answer to your question, but they are probably relevant. – gung - Reinstate Monica May 24 '12 at 17:22
  • 2
    Again, I suggest benchmarking the MC results against the manual calculations using the full permutation (re-randomization) test. See the [code for your example](http://pastebin.com/ZjiQ8A2B) for a comparison of `oneway_anova()` (always close to correct result) and `aovp()` (typically far off from the correct result). I don't know why `aovp()` gives wildly varying results, but at least for this case here they're implausible. @gung the last call to `oneway_test(DV ~ IV | id, ...)` in my original answer specified the error strata for the dependent case (different syntax than used by `aov()`). – caracal May 24 '12 at 19:11
  • @caracal, you're right. I didn't look at the last code block after the edit. I was looking at the top code block--sloppy on my part. – gung - Reinstate Monica May 24 '12 at 19:25
  • I don't really need the answer. It is just another possibility that is worth mentioning here. Unfortunately it is far off from the other results which I also worth noticing. – Henrik May 25 '12 at 09:23
  • The results jump up and down because aovp() does random samplings if the number of observations is greater than maxExact, which is 10 by default. – xmjx Jul 16 '13 at 23:04
  • @xmjx SO what do you recommend to obtain reasonable results? – Henrik Jul 17 '13 at 06:56
  • 1
    @Henrik run aovp with maxExact=1000. If it takes too long, set iter=1000000 and Ca=0.001. The calculation terminates when the estimated standard error of p is less than Ca*p. (Lower values give more stable results.) – xmjx Jul 19 '13 at 07:57
  • @xmjx, do you know why maxExact is set (by default) to such a low value? I was comparing my results against `ezPerm` (yet another permutation test function from the package `ez`) and the first thing that call my attetion was the execution time difference between both functions. I wonder how many papers has been published (or discarded) with aovp and its default parameters. – toto_tico Nov 09 '15 at 21:18
  • @xmjx, just to avoid the confution, any `maxExact` value bigger than 20 would produce the same result because maxExact is a treshold of the **maximum number of observations allowed. If data exceeds this, Prob is used**. Based on the calculation times, `Ca` and `Iter` don't really produce any effect, unless the data of this example produces an extremely quick convergence. – toto_tico Nov 09 '15 at 22:49
6

My comments are not about implementation of the permutation test but about the more general issues raised by these data and the discussion of it, in particular the post by G. Jay Kerns.

The two distributions actually look quite similar to me EXCEPT for the group of 0s in Y1, which are much different from the other observations in that sample (next smallest is about 50 on the 0-100 scale) as well as all those in X1. I would first investigate whether there was anything different about those observations.

Second, assuming those 0s do belong in the analysis, saying the permutation test isn't valid because the distributions appear to differ begs the question. If the null were true (distributions are identical), could you (with reasonable probability) get distributions looking as different as these two? Answering that's the whole point of the test, isn't it? Maybe in this case some will consider the answer obvious without running the test, but with these smallish, peculiar distributions, I don't think I would.

Andrew Taylor
  • 61
  • 1
  • 1
  • 1
    It appears that this should be one or more comments, not an answer. If you click the small gray "add comment", you can place your thoughts in the conversation below the question or a particular answer, where they belong. You do make substantive points here, but it is not clear that this is not the appropriate place for them. – gung - Reinstate Monica Aug 26 '12 at 02:32
  • 1
    @gung It takes a little reputation to be able to post a comment ;-). – whuber Aug 26 '12 at 20:51
  • 4
    This is a good point about applicability of the permutation test. As to the obviousness of the difference between the groups, perhaps it's a matter of experience :-). For intuition, because it's apparent the key difference is in the small values, we might inquire about the chances that the seven smallest in a set of 40 values happened to fall in one random subset of 20. Roughly, each has about a 1/2 chance of being in the subset or its complement, so all seven will be in the same group with chances around $2(1/2)^7 \approx .01$. This mental arithmetic provides quick initial guidance. – whuber Aug 26 '12 at 21:00
2

Are these scores proportions? If so, you certainly shouldn't be using a gaussian parametric test, and while you could go ahead with a non-parametric approach like a permutation test or bootstrap of the means, I'd suggest that you'll get more statistical power by employing a suitable non-gaussian parametric approach. Specifically, any time you can compute a proportion measure within a unit of interest (ex. participant in an experiment), you can and probably should use a mixed effects model that specifies observations with binomially distributed error. See Dixon 2004.

Mike Lawrence
  • 12,691
  • 8
  • 40
  • 65
  • The scores are not proportions but estimates by participants on a 0 to 100 scale (the presented data are means of estimates on several items with that scale). – Henrik Jan 10 '11 at 13:57
  • Then non-parametrics would seem the traditional way to go. That said, I've wondered if such scale data might usefully be inferred to derive from a binomial process and thereby analyzed as such. That is, you say that each score is the mean of several items, and let's say each item is a 10 point scale, in which case I'd represent a response of, say, "8" as a series of trials, 8 of which have the value 1 and two of which have the value 0, all labelled with the same label in an "item" variable. With this expanded/binomial-ized data, you could then compute the binomial mixed effects model. – Mike Lawrence Jan 10 '11 at 14:24
  • Following from my previous comment, I should note that in the expanded/binomial-ized data, you could model the "item" variable as either a fixed or random effect. I think I'd lean toward modelling it as a fixed effect because presumably you might be interested in not only accounting for but also assessing the item differences and any possible interaction between item and other predictor variables. – Mike Lawrence Jan 10 '11 at 14:26
1

Just adding another approach, ezPerm of ez package:

> # preparing the data
> DV <- c(x1, y1)
> IV <- factor(rep(c("A", "B"), c(length(x1), length(y1))))
> id <- factor(rep(1:length(x1), 2))
> df <- data.frame(id=id,DV=DV,IV=IV)
>
> library(ez)
> ezPerm( data = df, dv = DV, wid = id, within = IV, perms = 1000)
|=========================|100%              Completed after 17 s 
  Effect     p p<.05
1     IV 0.016     *

This seems to be consistent to the oneway_test of the coin package:

> library(coin)
> pvalue(oneway_test(DV ~ IV | id,  distribution=approximate(B=999999)))
[1] 0.01608002
99 percent confidence interval:
 0.01575782 0.01640682

However, notice that this is not the same example provided by @caracal. In his example, he includes alternative="greater", therefore the difference in p-values ~0.008 vs ~0.016.

The aovp package suggested in one of the answers produce suspiciously lower p-values, and runs suspiciously fast even when I try high values for the Iter, Ca and maxIter arguments:

library(lmPerm)
summary(aovp(DV ~ IV + Error(id/IV), data=df,  maxIter = 1000000000))
summary(aovp(DV ~ IV + Error(id/IV), data=df,  Iter = 1000000000))
summary(aovp(DV ~ IV + Error(id/IV), data=df,  Ca = 0.00000000001))

That said, the arguments seems to be slightly reducing the variations of p-values from ~.03 and ~.1 (I got a bigger range thant the reported by @Henrik), to 0.03 and 0.07.

toto_tico
  • 131
  • 6