On Fisher's exact test: What test would have been appropriate if the lady hadn't known the number of milk-first cups?

Question

In the famous lady tasting tea experiment by RA Fisher, the lady is informed of how many milk-first/tea-first cups there are (4 for each out of 8 cups). This respects the fixed marginal total assumption of Fisher's exact test.

I was imagining doing this test with my friend, but the thought struck me. If the lady can really tell the difference between milk-first and tea-first cups, she should be able to figure out marginal totals of the milk-first/tea-first cups as well as which ones are which.

So here is the question: What test could have been used if RA Fisher hadn't informed the lady of the total number of milk-first and tea-first cups?

Some would argue that even if the second margin is not fixed by design, it carries little information about the lady's ability to discriminate (i.e. it's approximately ancillary) & should be conditioned on. The exact unconditional test (first proposed by Barnard I think) is more complicated because you have to calculate the maximal p-value over all possible values of a nuisance parameter. — Scortchi - Reinstate Monica, Feb 06 '15 at 15:08
In fact [Barnard's test](http://en.wikipedia.org/wiki/Barnard%27s_test) has a Wikipedia page. — Scortchi - Reinstate Monica, Feb 06 '15 at 15:11
@Scortchi what more is there to say? I wouldn't add anything to it (nor would I manage to say it so clearly and succinctly). Across your two comments I think you have a fine answer there. — Glen_b, Feb 06 '15 at 16:52
@Glen_b: Thanks. But the first point raises the very difficult question of how "nearly ancillary" a statistic has to be before you'd want to condition on it - how *do* you trade-off information loss against having a more relevant sample space? I don't have an answer. — Scortchi - Reinstate Monica, Feb 06 '15 at 17:40
That's a harder question, but one will either end up doing it (conditioning) or not - and you've described what you would do in both cases. Working out exactly when an unspecified person would do which test seems to be a harder task than is required to answer the question ("what test could have been used?") -- the two most obvious answers for what test could be used (the two sides of the debate that's been going for decades - at some level since the 1920s I think) are in your comments already. — Glen_b, Feb 06 '15 at 17:55
I don't quite understand the complicated answer given by @Scortchi. Shouldn't we use a binomial distribution if the lady is not told the number of cups prepared by each method? The null hypothesis would be that she is guessing randomly between milk-first and tea-first, so she has 0.5 probability of correctly guessing the method of preparation for each cup, right? (And the alternative would be that she's doing better than random guessing.) — lostisle, Aug 31 '17 at 21:05
There's some discussion worth looking at (among both paper and discussants) in Yates, F. (1984) "Tests of Significance for 2 × 2 Contingency Tables", *Journal of the Royal Statistical Society. Series A (General)*, Vol. 147, No. 3, pp. 426-463. — Glen_b, Sep 01 '17 at 02:13
@lostisle: Good point! The unconditional test in this case will simplify to testing the null $X\sim \mathrm{Bin}(8,\frac{1}{2})$, where $X$ is the total no. correct judgements: but that doesn't generalize to cases when the no. cups of tea presented with & without milk are unequal; & your derivation isn't quite right. Suppose the lady guesses, randomly, "milk" with $\frac{2}{3}$ probability (& "no milk" with $\frac{1}{3}$ probability). Then she guesses correctly with $\frac{2}{3}$ probability when there is milk in the tea & $\frac{1}{3}$ when there isn't. — Scortchi - Reinstate Monica, Sep 02 '17 at 14:25
Only on average, over 4 cups of tea with milk & 4 without, can we say she's a probability of $\frac{1}{2}$ of guessing correctly; & the distribution of her total of correct guesses is more concentrated about 4 than if it followed a binomial distribution. But if she guesses "milk" with $\frac{1}{2}$ probability, then she'll guess correctly with $\frac{1}{2}$ probability when there's milk in the tea & when there isn't, & the distribution of her total of correct guesses does follow a binomial distribution. So there is a nuisance parameter to consider, ... — Scortchi - Reinstate Monica, Sep 02 '17 at 14:30
... but when the no. cups of tea presented with & without milk are equal, the simple null that maximizes the p-value is that she guesses "milk" with probability $\frac{1}{2}$; if she were presented with 5 cups with milk in & 3 without that wouldn't be so. Furthermore you might doubt the usefulness of the total of correct guesses as a test statistic if she could get 5 right just by saying "milk" for every cup but only 3 right by saying "no milk" for every cup - wouldn't correct judgements that there was no milk carry greater weight? [Now that I've thought about this ... — Scortchi - Reinstate Monica, Sep 02 '17 at 14:39
... I realize I shouldn't have said "*The* unconditional test in this case will simplify [...]". Different test statistics can be used for unconditional tests, including the one you suggest, & they don't invariably put possible tables in the same order.] — Scortchi - Reinstate Monica, Sep 02 '17 at 15:14

Scortchi - Reinstate Monica · Accepted Answer · 2016-02-19T16:46:20.490

Some would argue that even if the second margin is not fixed by design, it carries little information about the lady's ability to discriminate (i.e. it's approximately ancillary) & should be conditioned on. The exact unconditional test (first proposed by Barnard) is more complicated because you have to calculate the maximal p-value over all possible values of a nuisance parameter, viz the common Bernoulli probability under the null hypothesis. More recently, maximizing the p-value over a confidence interval for the nuisance parameter has been proposed: see Berger (1996), "More Powerful Tests from Confidence Interval p Values", The American Statistician, 50, 4; exact tests having the correct size can be constructed using this idea.

Fisher's Exact Test also arises as a randomization test, in Edgington's sense: a random assignment of the experimental treatments allows the distribution of the test statistic over permutations of these assignments to be used to test the null hypothesis. In this approach the lady's determinations are considered as fixed (& the marginal totals of milk-first and tea-first cups are of course preserved by permutation).

Can [`Barnard::barnardw.test()`](http://finzi.psych.upenn.edu/R/library/Barnard/html/barnardw.test.html) be used here? What difference in computational complexity can be expected in practice? — krlmlr, Aug 21 '15 at 09:28
I'm not familiar with that package, but the help page you link to references exactly the test I was talking about. See also [`Exact`](https://cran.r-project.org/web/packages/Exact/index.html). As to computational complexity, I don't know - it's going to depend on the maximization algorithm used. — Scortchi - Reinstate Monica, Aug 21 '15 at 09:39

score 3 · Answer 2 · answered Jun 26 '15 at 16:55

Today, I read the first chapters of "The Design of Experiments" by RA Fisher, and one of the paragraph made me realize the fundamental flaw in my question.

That is, even if the lady can really tell the difference between milk-first and tea-first cups, I can never prove she has that ability "by any finite amount of experimentation". For this reason, as an experimenter, I should start with the assumption that she doesn't have an ability(null hypothesis) and try to disapprove that. And the original experiment design(fisher exact test)is a sufficient,efficient, and justifiable procedure to do so.

Here is the excerpt from "The Design of Experiments" by RA Fisher:

It might be argued that if an experiment can disprove the hypothesis that the subject possesses no sensory discrimination between two different sorts of object, it must therefore be able to prove the opposite hypothesis, that she can make some such discrimination. But this last hypothesis, however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experiment, because it is inexact. If it were asserted that the subject would never be wrong in her judgments we hold again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation.

Sextus Empiricus · Answer 3 · 2018-12-20T19:38:36.077

Barnard's test is used when the nuisance parameter is unknown under the null hypothesis.

However in the lady tasting test you could argue that the nuisance parameter can be set at 0.5 under the null hypothesis (the uninformed lady has 50% probability to correctly guess a cup).

Then the number of correct guesses, under the null hypothesis, becomes a binomial distribution: guessing 8 cups with 50% probability for each cup.

In other occasions you may not have this trivial 50% probability for the null hypothesis. And without fixed margins you may not know what that probability should be. In that case you need Barnard's test.

Even if you would do Barnard's test on the lady tasting tea test, it would become 50% anyway (if the outcome is all correct guesses) since the nuisance parameter with the highest p-value is 0.5 and would result in the trivial binomial test (it is actually the combination of two binomial tests one for the four milk first cups and one for the four tea first cups).

> library(Barnard)
> barnard.test(4,0,0,4)

Barnard's Unconditional Test

           Treatment I Treatment II
Outcome I            4            0
Outcome II           0            4

Null hypothesis: Treatments have no effect on the outcomes
Score statistic = -2.82843
Nuisance parameter = 0.5 (One sided), 0.5 (Two sided)
P-value = 0.00390625 (One sided), 0.0078125 (Two sided)

> dbinom(8,8,0.5)
[1] 0.00390625

> dbinom(4,4,0.5)^2
[1] 0.00390625

Below is how it would go for a more complicated outcome (if not all guesses are correct e.g. 2 versus 4), then the counting of what is and what is not extreme becomes a bit more difficult

(Note as well that Barnard's test uses, in the case of a 4-2 result a nuisance parameter p=0.686 which you could argue is not correct, the p-value for 50% probability of answering 'tea first' would be 0.08203125. This becomes even smaller when you consider a different region, instead the one based on Wald's statistic, although defining the region is not so easy)

out <- rep(0,1000)
for (k in 1:1000) {
  p <- k/1000
  ps <- matrix(rep(0,25),5)   # probability for outcome i,j
  ts <- matrix(rep(0,25),5)   # distance of outcome i,j (using wald statistic)
  for (i in 0:4) {
    for (j in 0:4) {
      ps[i+1,j+1]  <- dbinom(i,4,p)*dbinom(j,4,p)
      pt <- (i+j)/8
      p1 <- i/4
      p2 <- j/4
      ts[i+1,j+1] <- (p2-p1)/sqrt(pt*(1-pt)*(0.25+0.25))
    }
  } 
  cases <- ts < ts[2+1,4+1]
  cases[1,1] = TRUE
  cases[5,5] = TRUE
  ps
  out[k] <- 1-sum(ps[cases])
}

> max(out)
[1] 0.08926748
> barnard.test(4,2,0,2)

Barnard's Unconditional Test

           Treatment I Treatment II
Outcome I            4            2
Outcome II           0            2

Null hypothesis: Treatments have no effect on the outcomes
Score statistic = -1.63299
Nuisance parameter = 0.686 (One sided), 0.314 (Two sided)
P-value = 0.0892675 (One sided), 0.178535 (Two sided)

On Fisher's exact test: What test would have been appropriate if the lady hadn't known the number of milk-first cups?

3 Answers3

Linked