Adjusting the p-value for adaptive sequential analysis (for chi square test)?

Question

I wish to know what statistical literature is relevant for the following problem, and maybe even an idea on how to solve it.

Imagine the following problem:

We have 4 possible treatments for some disease. In order to check which treatment is better, we perform a special trial. In the trial, we start by having no subjects, then, one by one, more subjects are entered into the trial. Each patient is randomly allocated to one of the 4 possible treatments. The end result of a treatment is either "healthy" or "still sick", and let us say we can know this result instantly. This means that at any given point, we can create a two by four contingency table, saying how many of our subjects fell into which treatment/end-result.

At any point we can check the contingency table (for example, using a chi square test), to see if there is a statistically different treatment between the 4 possible treatments. If one of them is better then all the rest - we stop the trial and choose it as the "winner". If some trial is shown to be worse then all the other three, we will drop him from the trial and stop giving it to future patients.

However, the problem here is how do I adjust the p-value for the fact that the test can be performed at any given point, that there is correlation between the tests, and also that the adaptive nature of the process manipulates the process (for example, if some treatment is found to be "bad")?

Wald came up with his sequential probability ratio test (SPRT) to create a stopping rule, the number of subjects that you need to have evidence against the null. See my explanation here: http://stats.stackexchange.com/a/16120/401 This only tests a single hypothesis, though. But, when you propose a chi-squared test, that is only a single hypothesis (all treatments are equally effective). It seems that you could adjust the "primary" p-value in my post for multiple testing and do several tests. I would have to think more about how to incorporate the changing of the set of treatments. — Charlie, Feb 01 '12 at 15:54
I just want to note that there is a variation called "Group Sequential Analysis" dealing with more than one parameter.The book _Clinical Statistics: Introducing Clinical Trials, Survival Analysis, and Longitudinal Data Analysis_ could help according to various sources, but I have never read it personally. — mlwida, Feb 02 '12 at 08:56
I cannot emphasize how interesting this question is. Solving it will also answer a lot of questions regarding ab-tests (same task, but the error costs are ridiculously lower) — mlwida, Feb 02 '12 at 08:59
The book *Group Sequential Methods with Applications to Clinical Trials* by Jennison and Turnbull covers many such sequential trial designs. I don’t remember if the four-treatment design is covered (but I guess this is just a logistic regression model with three dummy variables), but it’s a nice book, and very well worth reading if you’re interested in problems like this. (And @steffen, the A/B-test (i.e., simple binomial problem) *is* covered in the book.) — Karl Ove Hufthammer, Apr 17 '15 at 20:16

score 2 · Answer 1 · answered Jul 13 '15 at 20:57

This area of sequential clinical trials has been explored substantially in the literature. Some of the notable researchers are Scott Emerson, Tom Flemming, David DeMets, Stephen Senn, and Stuart Pocock among others.

It's possible to specify an "alpha-spending-rule". The term has its origins in the nature of frequentist (non-Fisherian) testing where, each action that increases the risk of a false positive finding should necessarily reduce power to keep the test of the correct size. However, the majority of such tests require that "stopping rules" are prespecified based on the information bounds of the study. (as a reminder, more information means greater power when the null is false).

It sounds like what you are interested is a continuous monitoring process in which each event-time warrants a "look" into the data. To the best of my knowledge, such a test has no power. It can be done with Bayesian analysis where the posterior is continuously updated as a function of time, and Bayes Factors are used to summarize evidence rather than $p$-values.

See

[1] www.rctdesign.org/

+1. I posted another answer where I use a simulation to compute the the type II error rate of the suggested procedure. This allows to choose a nominal alpha such that the test has correct size. I wonder what you think about it. — amoeba, Jul 14 '15 at 20:08

amoeba · Answer 2 · 2015-07-16T23:27:59.350

This sounds like a simulation is in order.

So I simulated your procedure as follows: $N=1000$ people are added to the trial one-by-one, randomly assigned to one of the $4$ groups. The outcome of the treatment for this person is chosen randomly (i.e. I am simulating null hypothesis of all treatments having zero effect). After adding each person, I perform a chi squared test on the $4 \times 2$ contingency table and check if $p\le \alpha$. If it is, then (and only then) I additionally perform chi squared tests on the reduced $2 \times 2$ contingency tables to test each group against other three groups pooled together. If one of these further four tests comes out significant (with the same $\alpha$), then I check if this treatment performs better or worse than the other three pooled together. If worse, I kick this treatment out and continue adding people. If better, I stop the trial. If all $N$ people are added without any winning treatment, the trial is over (note that the results of my analysis will strongly depend on $N$).

Now we can run this many times and find out in what fraction of runs one of the treatments comes out as a winner -- these would be false positives. If I run it 1000 times for nominal $\alpha=0.05$, I get 282 false positives, i.e. $0.28$ type II error rate.

We can repeat this whole analysis for several nominal $\alpha$ and see what actual error rate we get: $$\begin{array}{cc}\alpha & \text{error rate} \\ 0.05 & \sim 0.28 \\ 0.01 & \sim 0.06 \\ 0.001 & \sim 0.008\end{array}$$ So if you want the actual error rate to be held say at $0.05$ level, you should choose the nominal $\alpha$ of around $0.008$ -- but of course it is better to run a longer simulation to estimate this more precisely.

My quick and dirty code in Matlab is below. Please note that this code is brain-dead and not optimized at all; everything runs in loops and horribly slow. This can probably be accelerated a lot.

function seqAnalysis()
    alphas = [0.001 0.01 0.05];
    for a = 1:length(alphas)
        falsePositives(a) = trials_run(1000, 1000, alphas(a));
    end
    display(num2str([alphas; falsePositives]))
end

function outcome = trials_run(Nrep, N, alpha)
    outcomes = zeros(1,Nrep);
    for rep = 1:Nrep
        if mod(rep,10) == 0
            fprintf('.')            
        end
        outcomes(rep) = trial(N, alpha);
    end
    fprintf('\n')
    outcome = sum(outcomes);
end


function result = trial(N, alpha)
    outcomes = zeros(2,4);

    result = 0;
    winner = [];

    %// adding subjects one by one
    for subject = 1:N
        group = randi(size(outcomes,2));
        outcome = randi(2);    
        outcomes(outcome, group) = outcomes(outcome, group) + 1;

        %// if groups are significantly different
        if chisqtest(outcomes) < alpha
            %// compare each treatment against the rest
            for group = 1:size(outcomes,2)
                contrast = [outcomes(:, group) ...
                            sum(outcomes(:, setdiff(1:size(outcomes,2), group)),2)];
                %// if significantly different
                if chisqtest(contrast) < alpha
                    %// check if better or worse
                    if contrast(1,1)/contrast(2,1) < contrast(1,2)/contrast(2,2)
                        %// kick out this group
                        outcomes = outcomes(:, setdiff(1:size(outcomes,2), group));
                    else
                        %// winner!
                        winner = group;
                    end
                    break
                end
            end
        end

        if ~isempty(winner)
            result = 1;    
            break
        end
    end
end

function p = chisqtest(x)
    e = sum(x,2)*sum(x)/sum(x(:));
    X2 = (x-e).^2./e;
    X2 = sum(X2(:));
    df = prod(size(x)-[1 1]);
    p = 1-chi2cdf(X2,df);
end

Adjusting the p-value for adaptive sequential analysis (for chi square test)?

2 Answers2

Linked