Why does significance not equal validity? Why can't I stop a test as soon as it has significant results?

Question

According to this article, there's an issue with concluding a test as soon as it reaches significance (the "peeking effect"). Instead you need to determine the sample size ahead of time, and only look at the results once the test has gathered the predetermined number of observations.

This article also says that "statistical significance is not a stopping rule", and that just because a variation has 99% probability of beating the original it doesn't mean you can stop early. But what I don't understand fully is "why". If something has 99% probability of beating the original, that means given the observed effect you should make the correct decision 99% of the time?

So I have two questions:

Doesn't a confidence interval/significance account for the sample size? That is, even if the test swings wildly early on, an early 99% confidence still means that given the few observations, the results are still extreme enough to fall outside of the 99% of expected results? Isn't that the whole point of significance?
Let's say I do determine that I need to run each variation say 10K times each. At the end of this I find there is no conclusive evidence. So I decide to let the test continue to run for another 10K observations per variation. How does this affect power and significance?

The idea in the article is ok but its recommendation is not a great one. There *are* good ways to perform sequential tests where the sample size is not determined ahead of time. That seems to accord with the intuition behind the question (1). The main problem doesn't really stem from performing a sequence of tests, but rather that when you compute "confidence" after each observation, you have to do it *correctly*. The wrong way is to use a calculation designed for a *single* decision made from a sample of a fixed size. — whuber, Nov 17 '15 at 18:25
@whuber Can you give a citation for a decent intro text or two about good ways to perform sequential tests? — Alexis, Nov 17 '15 at 19:11
@Alexis There is a discussion in Kendall & Stuart's *Advanced Theory of Statistics* (fifth ed.). See Volume 2, chapter 24. Some of the papers are useful, too. I recently used the results from Bill Meeker's work of 30 years ago (related to his PhD thesis) with some success on an A/B testing problem. — whuber, Nov 17 '15 at 21:38
I understand that if you check daily if it has reached significance your giving the test multiple "chances" which will increase the type I error rate and mess up significance. But still seems to me, if your test reaches 99% significance in 24 hours and you act on it, you should be making the right decision 99% of the time? In his example, isn't getting only 1 conversion in 110 observation such an extreme result that we should only expect it 1 in a 100 times if h0(= no diff) was true? So was his example just a freak accident? — L Xandor, Nov 18 '15 at 00:25
(1) if your significance level is higher than you think, the p-values are similarly wrong. $\ $ (2) with enough peeking your true p-value might exceed 0.5, even when your calculated p-value is 0.01. $\ $ (3) Even if it were correctly calculated, a p-value of 0.01 *doesn't* mean "you have a 99% chance of making the right decision". — Glen_b, Nov 18 '15 at 01:52
You might find some value in the discussion [here](http://stats.stackexchange.com/questions/181710/testing-for-significance-after-every-subject/181714#181714) — Glen_b, Nov 18 '15 at 01:56
@Glen_b OK I worded it incorrectly. I mean correct 99% of the time with regards to Type I error. That is, if I perform an infinite number of tests, and after 24 hours declare a winner only if the p-value is less than 0.01, then I should have a type I error of 0.01 or less? That is, I will only incorrectly declare a winner in at most 1% of all tests? Is that an accurate statement? — L Xandor, Nov 18 '15 at 16:13
@Glen_b Run test 24 hours. Declare winner if (and only if) p-value is < 0.01 (and regardless of nominal sample size).Repeat infinitely. Type 1 error will be 1% or less? I.e. I will only incorrectly declare a winner in 1% of tests (or less)? — L Xandor, Nov 18 '15 at 21:02
It's still unclear, I am sorry. Hypothesis tests don't generally have a "winner"; the null and alternative are treated differently. When you say "declare a winner" what test are you doing? What's the null, and what's the alternative? — Glen_b, Nov 18 '15 at 22:03
Sorry, I thought the standard null hypothesis for A/B testing was "no difference" (so h0 = no difference in conversion rate) . Declare a winner is when performance is higher or lower on the test version compared to the control and p-vale is < 0.01 — L Xandor, Nov 19 '15 at 22:47

score 4 · Answer 1 · edited Apr 28 '16 at 21:05

4

This prompts the question: why not stop early when you saw that your results were insignificant?

The answer in both cases, you are falling to confirmation bias, i.e., you ignore any new evidence once what you believe has been confirmed. This would render your analysis very biased towards your initial viewpoint.

We have to be careful not to rely too much on statistical significance. After all, statistical significance just means is "this result would be very strange if the null hypothesis is true".

The American statistical association published a statement critical of p-values recently, addressing this issue.

edited Apr 28 '16 at 21:05

gung - Reinstate Monica

132,789
81
357
650

answered Apr 28 '16 at 20:58

Chris P

132
5

I appreciate the edit @gung, I need to remember to clean up my answers a bit more. – Chris P Apr 28 '16 at 21:09
Welcome to the site, @ChrisP. It's not a problem. It takes a while to learn the various editing & formatting options the site affords. – gung - Reinstate Monica Apr 28 '16 at 21:17

Why does significance not equal validity? Why can't I stop a test as soon as it has significant results?

1 Answers1