73

On 25 February 2015, the journal Basic and Applied Social Psychology issued an editorial banning $p$-values and confidence intervals from all future papers.

Specifically, they say (formatting and emphasis are mine):

  • [...] prior to publication, authors will have to remove all vestiges of the NHSTP [null hypothesis significance testing procedure] ($p$-values, $t$-values, $F$-values, statements about ‘‘significant’’ differences or lack thereof, and so on).

  • Analogous to how the NHSTP fails to provide the probability of the null hypothesis, which is needed to provide a strong case for rejecting it, confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval. Therefore, confidence intervals also are banned from BASP.

  • [...] with respect to Bayesian procedures, we reserve the right to make case-by-case judgments, and thus Bayesian procedures are neither required nor banned from BASP.

  • [...] Are any inferential statistical procedures required? -- No [...] However, BASP will require strong descriptive statistics, including effect sizes.

Let us not discuss problems with and misuse of $p$-values here; there already are plenty of excellent discussions on CV that can be found by browsing the p-value tag. The critique of $p$-values often goes together with an advice to report confidence intervals for parameters of interest. For example, in this very well-argued answer @gung suggests to report effect sizes with confidence intervals around them. But this journal bans confidence intervals as well.

What are the advantages and disadvantages of such an approach to presenting data and experimental results as opposed to the "traditional" approach with $p$-values, confidence intervals, and significant/insignificant dichotomy? The reaction to this ban seems to be mostly negative; so what are the disadvantages then? American Statistical Association has even posted a brief discouraging comment on this ban, saying that "this policy may have its own negative consequences". What could these negative consequences be?

Or as @whuber suggested to put it, should this approach be advocated generally as a paradigm of quantitative research? And if not, why not?

PS. Note that my question is not about the ban itself; it is about the suggested approach. I am not asking about frequentist vs. Bayesian inference either. The Editorial is pretty negative about Bayesian methods too; so it is essentially about using statistics vs. not using statistics at all.


Other discussions: reddit, Gelman.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 15
    There is a one-to-one mapping between p-values and confidence intervals in linear regression models, so I don't see a strong reason why banning p-values but keeping confidence intervals would make much sense. But banning both p-values and confidence intervals leaves a gap in description of results... I wonder if they allow reporting standard errors (that would be another measure of the same one-to-one mapping group). – Richard Hardy Feb 25 '15 at 19:07
  • 1
    To add to this question - what possible "strong descriptive statistics" would be available, aside from Bayesian ones? – Lubo Antonov Feb 25 '15 at 19:51
  • 3
    @RichardHardy Confidence intervals provide more information without the input of additional work. – Fomite Feb 25 '15 at 21:22
  • 8
    **Everything** could be misused so banning stuff on this condition is, well... strange. I am not the fan of p-values but this seems as a pretty naive approach to the problem. One thing is encouraging to use proper stuff, but banning things does not sound like a proper way to deal with the problem... – Tim Feb 25 '15 at 21:23
  • 13
    Great idea. Using statistics just hides the unscientific nature of this field. – Aksakal Feb 25 '15 at 21:35
  • 1
    It's a great topic. I'm a little bothered by the phrasing, though, because "wise" seems a little too vague and broad to fit our framework and might not admit a unique or definite answer. Could you perhaps modify it in a way that will make it clear what kind(s) of answers are required and indicate how to tell good answers from not so good ones? – whuber Feb 25 '15 at 21:50
  • 1
    @Lubo Many hundreds of descriptive statistics are available, depending on the nature of the data, and thousands of graphical representations. I think a discussion of those would be tangential to the question rather than adding to it. – whuber Feb 25 '15 at 21:52
  • 1
    @whuber: Thanks for a suggestion. To tell the truth, I was hesitating to post this question, because I was afraid that it might get frowned upon or even closed as "opinion-based". I have now added a paragraph specifying what sort of answer I am mainly looking for. – amoeba Feb 25 '15 at 22:09
  • 2
    Reminds me of an English teacher at junior school who banned us from using the word "nice". So it could do some good; but if so that doesn't say much for the quality of research in the field. – Scortchi - Reinstate Monica Feb 25 '15 at 22:46
  • 2
    @Scortchi: right, so NHST should by default be a mark against a paper approval (most certainly if it is applied incorrectly), but it shouldn't be banned outright. I mean, Gödel's incompleteness theorem leaves us in a position where *no statistical procedure will ever be valid* unless you accept at least one unprovable axiom somewhere. So banning something completely because it is somewhat flawed is just stupid. – naught101 Feb 26 '15 at 01:03
  • 4
    This seems like a complete overreaction to the frustration over the misuse of p values. I would be much happier with a ban on the misuse of p values rather than P values in general. – TrynnaDoStat Feb 26 '15 at 01:17
  • @Scortchi: this was a great remark. By the way, I would encourage you to write a full answer here; this question seems to become quite popular, and I think it would be useful for the community if it had some answers directly discussing possible drawbacks/benefits of such a policy (see my added last paragraph). You are one of the people here who I am sure could present some valuable arguments. – amoeba Feb 26 '15 at 10:28
  • 3
    Is this a wise decision? is the pivot here. For whom? In what way? At one level this is trivially easy: if you work in that field, the decision made public will help you decide whether to submit to that journal. While I am as negative as anyone here about abuses of inferential statistics, my wild guess is that this decision will, overall, **seriously weaken that journal's reputation**. Banning is a bad idea, as others have emphasised. For most researchers, the existence of a journal somewhere which is run strangely is of no consequence. – Nick Cox Feb 26 '15 at 11:37
  • @Nick, I don't care much about this journal, I was certainly asking about the policy itself. My question is whether it is better or worse to communicate the results of a study according to this suggested policy, as opposed to the traditional way with p-values and confidence intervals. What style of presentation would you, as a reader, prefer, and why? This is the real question. I hoped it would be obvious, but perhaps I should try to make it more clear. – amoeba Feb 26 '15 at 11:43
  • 1
    If the question were closer to Would it be a wise decision to ban P-values (etc.) in publications? I think it becomes much more elusive. As another analogy, we have scope to ask people not to smoke in our house (and in any case almost no-one we know smokes any way and the others know without asking that it would be unwelcome) but whether smoking should be banned is a much more tangled question. – Nick Cox Feb 26 '15 at 11:47
  • @Nick, I see. The question is not (at least not so much!) whether *the ban* is a good idea, it is about whether the suggested policy is a good idea. I edited the question to clarify (see the bottom paragraph). I still have to think how to edit the title so that it stays concise and reasonably catchy, while representing what I am really asking. Any suggestions? – amoeba Feb 26 '15 at 11:53
  • Sorry, no. My personal view is that while this decision raises numerous interesting and important questions, it's hard to see a focused question here suited to the style of this forum. I expect that some statistical blogs will pick this up. – Nick Cox Feb 26 '15 at 12:08
  • In light of the edits, I (reluctantly) have to agree with @NickCox. The question now asks explicitly for *opinions*. "What ... would you ... prefer?" and "would it do good for the scientific field?" are at once vague and speculative. Given the potential exposure this thread will get, we need to make sure it fits our framework and invites clear answers that readers can evaluate objectively. As evidence of that we have begun to get garbage answers by drive-by (zero-rep) readers, so I have also protected the thread and made it CW pending edits to improve the question. – whuber Feb 26 '15 at 17:52
  • 9
    The 4th item in your list suggests they're *not* requiring point estimates, which would be inference, but effect sizes reported merely as descriptive statistics. (Nevertheless, a few lines down in the editorial, "we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem". I look forward to the 2016 editorial's calling for research into formalizing this notion of stability & accounting quantitatively for the effects of sampling error.) – Scortchi - Reinstate Monica Feb 26 '15 at 18:55
  • 6
    The American Statistical Association has just posted an official comment on this at http://community.amstat.org/blogs/ronald-wasserstein/2015/02/26/asa-comment-on-a-journals-ban-on-null-hypothesis-statistical-testing. It ends, "The ASA encourages the editors of this journal ... not [to] discard the proper and appropriate use of statistical inference." – whuber Feb 26 '15 at 19:54
  • 1
    In the end this may well be positive. For starters, it gets some abuses out in the open (rather than the current practice for many journals of decrying problems in editorials and then allowing authors, referees and editors to go on sticking with the same ol' - some journals might hold better to their claimed ideals). Secondly, it's going to get people to actually *explain* what good statistical practice is, and why it's important. I'm glad I'm not trying to publish in that journal, but if it makes some researchers think a little harder about what they're doing, well, that may be a good thing. – Glen_b Feb 26 '15 at 22:51
  • Gelmans response: http://andrewgelman.com/2015/02/26/psych-journal-bans-significance-tests-stat-blogger-inundated-with-emails/ – captain_ahab Feb 27 '15 at 00:40
  • @amoeba: Sorry, but there's clearly a context to all this of which I'm almost completely ignorant; any answer I could offer - along the lines of "What's the use of Statistics?" - would be as relevant as the views of a metrologist from the *Bureau international des poids et mesures* on the *Quebra–Quilos* revolt. (I noticed an advert the other day for a book aimed at psychologists: *Understanding the New Statistics: Effect sizes, Confidence Intervals, & Meta-Analysis*. It was published in 2012.) – Scortchi - Reinstate Monica Mar 02 '15 at 22:19
  • 2
    @Aksakal, I caution you against generalizing the policies of one journal to all of psychology, lest you fall prey to the same errors in logic made by the editors of BASP. – Patrick S. Forscher May 04 '15 at 21:18
  • http://www.smithsonianmag.com/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/ researchers from uva couldn't reproduce most papers published in top psychology Journals recently. What a surprise. Cargo cult science, what do you expect from it – Aksakal Aug 28 '15 at 11:57
  • @Aksakal: yes, I saw this paper when skimming through yesterday's Science issue. In the context of this thread, however, I am not sure what the lesson should be; it seems to me that not using (banning) p-values is unlikely to make published results more trustworthy. – amoeba Aug 28 '15 at 12:35
  • @amoeba, this p-value debacle is a symptom of a larger issue in their field. They need to go back to a drawing board, and stop trying to look more science-y. There's a value in fields which are not quantitative, e.g. philosophy or theology – Aksakal Aug 28 '15 at 14:06
  • How long before this nonsense from JAMA Psychology is refuted? [Among men, low resting heart rate in late adolescence was associated with an increased risk for violent criminality, nonviolent criminality, exposure to assault, and unintentional injury in adulthood](http://archpsyc.jamanetwork.com/article.aspx?articleid=2436277) They have confidence intervals, of course. – Aksakal Sep 09 '15 at 19:08
  • Great analysis of the *consequences* of this ban in this very journal: http://daniellakens.blogspot.pt/2016/02/so-you-banned-p-values-hows-that.html. It would constitute an excellent answer here... – amoeba Dec 13 '16 at 18:28

4 Answers4

24

The first sentence of the current 2015 editorial to which the OP links, reads:

The Basic and Applied Social Psychology (BASP) 2014 Editorial *emphasized* that the null hypothesis significance testing procedure (NHSTP) is invalid...

(my emphasis)

In other words, for the editors it is an already proven scientific fact that "null hypothesis significance testing" is invalid, and the 2014 editorial only emphasized so, while the current 2015 editorial just implements this fact.

The misuse (even maliciously so) of NHSTP is indeed well discussed and documented. And it is not unheard of in human history that "things get banned" because it has been found that after all said and done, they were misused more than put to good use (but shouldn't we statistically test that?). It can be a "second-best" solution, to cut what on average (inferential statistics) has come to losses, rather than gains, and so we predict (inferential statistics) that it will be detrimental also in the future.

But the zeal revealed behind the wording of the above first sentence, makes this look -exactly, as a zealot approach rather than a cool-headed decision to cut the hand that tends to steal rather than offer. If one reads the one-year older editorial mentioned in the above quote (DOI:10.1080/01973533.2014.865505), one will see that this is only part of a re-hauling of the Journal's policies by a new Editor.

Scrolling down the editorial, they write

...On the contrary, we believe that the p<.05 bar is too easy to pass and sometimes serves as an excuse for lower quality research.

So it appears that their conclusion related to their discipline is that null-hypotheses are rejected "too-often", and so alleged findings may acquire spurious statistical significance. This is not the same argument as the "invalid" dictum in the first sentence.

So, to answer to the question, it is obvious that for the editors of the journal, their decision is not only wise but already late in being implemented: they appear to think that they cut out what part of statistics has become harmful, keeping the beneficial parts -they don't seem to believe that there is anything here that needs replacing with something "equivalent".

Epistemologically, this is an instance where scholars of a social science partially retract back from an attempt to make their discipline more objective in its methods and results by using quantitative methods, because they have arrived at the conclusion (how?) that, in the end, the attempt created "more bad than good". I would say that this is a very important matter, in principle possible to have happened, and one that would require years of work to demonstrate it "beyond reasonable doubt" and really help your discipline. But just one or two editorials and papers published will most probably (inferential statistics) just ignite a civil war.

The final sentence of the 2015 editorial reads:

We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking. The NHSTP has dominated psychology for decades; we hope that by instituting the first NHSTP ban, we demonstrate that psychology does not need the crutch of the NHSTP, and that other journals follow suit.

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • Alecos, what do you mean by «it is an already proven scientific fact that "null hypothesis significance testing" is invalid»? – An old man in the sea. Feb 25 '15 at 21:33
  • 1
    @Anoldmaninthesea. That the way _they_ write the editorial implies that _they_ think that this is so. I added something to clarify. – Alecos Papadopoulos Feb 25 '15 at 21:35
  • 1
    Ok. Thanks. For a moment there I got worried... =D – An old man in the sea. Feb 25 '15 at 21:40
  • 5
    Yes...we have to be careful when writing tongue-in-cheek or sardonic replies on this site: they might be (completely) misunderstood! – whuber Feb 25 '15 at 21:53
  • 1
    Perhaps one should not shy away from naming this new editor: David Trafimow, and the paper his 2014 editorial (under paywall) cites to support the claim that "The null hypothesis significance testing procedure has been shown to be logically invalid" (sic!) is his own 2003 paper [that can be found here](http://homepage.psy.utexas.edu/HomePage/class/psy391p/trafimow.nhst.21003.pdf). I am not sure I have the guts to actually read it. – amoeba Feb 25 '15 at 23:11
  • 1
    @amoeba I deleted pat of my draft answer that raised doubts regarding this scholar's level of understanding of statistics (after I did some research and read some of his papers), because I wanted to focus on the editorial text itself. Try this, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957210/, and also in the 2015 editorial, re-read the part on Bayesian Statistics ("The usual problem...") – Alecos Papadopoulos Feb 26 '15 at 00:38
  • Maybe that first quote should have read *"...the null hypothesis significance testing procedure (NHSTP) **as it is commonly used in this and many other journals** is invalid..."*. – naught101 Feb 26 '15 at 00:57
  • 4
    @naught101 ...that wouldn't be very diplomatic. Notice that the way the NHSTP is condemned, it spares the psychologists themselves that they have used it in all these decades. If it was written the way you propose, it would look much more like a direct attack on their colleagues as scientists. As it now stands essentially the text implies that psychologists full of good intentions have been unfortunately misled in using the approach, by "someone", which misused his "power of scientific authority" in the matter... Perhaps by evil statisticians driven by scientific imperialism? – Alecos Papadopoulos Feb 26 '15 at 01:09
  • @naught101 Now, the text says "the tool is flawed". Under your proposal, the text would argue "the tool may be ok, but the way it was used was wrong". – Alecos Papadopoulos Feb 26 '15 at 01:11
  • 4
    A bad workman blames his tools. – naught101 Feb 26 '15 at 01:39
  • @naught101 I am under the impression that your assertion can be statistically tested as the alternative hypothesis -and then the null would be convincingly rejected. – Alecos Papadopoulos Feb 26 '15 at 01:43
  • What whuber said about sardoncism :P – naught101 Feb 26 '15 at 01:51
  • In trying to figure out if there is merit to what the editor is saying, I looked at the 2003 paper regarding "surprising insights". I am left confused by the first page, as he seems to construct a straw man of NHSTP with stating that if one rejects the null then the alternative is true and accepted; but I was taught that one cannot possibly accept the alternative, merely *assert* it, because it was not ever tested - the null is assumed and tested, and so all you can do is reject it or fail to. This seems fundamentally different...is my understanding deficient here, or is this all a bit 'off'? – BrianH Feb 26 '15 at 03:45
  • 3
    @BrianDHall I would suggest to look up more authoritative resources on the issues surrounding NHSTP (this site included), rather than the specific author's works on the issue. The matter is difficult and subtle -already from your comment one should discuss first the semantics around "accept" and "assert"... – Alecos Papadopoulos Feb 26 '15 at 03:51
  • Thank you - after looking some more into Bayesian statistics, I think I get at least a better idea of things, and I can see that the issue is indeed more difficult and subtle a quick shot editorial would imply. Now I think I understand your answer much better, too! – BrianH Feb 26 '15 at 05:03
  • *the attempt created "more good than bad"* - was this intended to be the reverse, more bad than good? Though I may have misread. – Silverfish Feb 26 '15 at 10:27
  • @Silverfish Thanks for reading my answer _that_ carefully! Fixed it. – Alecos Papadopoulos Feb 26 '15 at 11:24
  • 6
    @naught101: If you notice that the workman can't handle the chainsaw properly, you might not blame the tool. But you would still take it away from the workman, to prevent further harm ;-) – nikie Feb 26 '15 at 11:33
  • @nikie: And then you would offer them the option to use it again, if they first learn how to use it properly. – naught101 Feb 26 '15 at 11:42
  • Alecos, as you might have seen, I had to reformulate my question because it got closed as opinion-based. The main question is now about the suggested approach as such: namely, **what are the advantages and disadvantages of presenting data with plots and descriptive statistics *only***, as opposed to using p-values, significance tests, standard errors, and confidence intervals? See also the current formulation of my post. I like your answer (+1), but I will not be able to accept it unless you update it specifically addressing these issues. Would you consider such an update? – amoeba Feb 27 '15 at 10:22
  • 1
    @amoeba Mine was a bit of a meta-answer from the beginning, and naturally should not be considered for the green mark. There is no point in updating it, but if the other answers that I see start coming leave something to be said, I might contribute a second answer. – Alecos Papadopoulos Feb 27 '15 at 11:52
21

I feel that banning hypothesis tests is a great idea except for a select few "existence" hypotheses, e.g. testing the null hypothesis that there is not extra-sensory perception where all one would need to demonstrate to have evidence that ESP exists is non-randomness. But I think the journal missed the point that the main driver of poor research in psychology is the use of a threshold on $P$-values. It has been demonstrated in psychology and most other fields that a good deal of gaming goes on to arrive at $P < 0.05$. This includes hypothesis substitution, removing of observations, and subsetting data. It is thresholds that should be banned first.

The banning of confidence intervals is also overboard, but not for the reasons others have stated. Confidence intervals are useful only if one misinterprets them as Bayesian credible intervals (for suitable non-information priors). But they are still useful. The fact that their exact frequentist interpretation leads to nothing but confusion implies that we need to "get out of Dodge" and go Bayesian or likelihood school. But useful results can be obtained by misinterpreting good old confidence limits.

It is a shame that the editors of the journal misunderstood Bayesian statistics and don't know of the existence of pure likelihood inference. What they are seeking can be easily provided by Bayesian posterior distributions using slightly skeptical priors.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • +1, thanks. Let me clarify regarding confidence intervals. Confidence intervals are related to standard errors, so the suggestion is probably to stop using those as well. Let's consider the simplest case: some value is measured across a group of $n$ subjects/objects; let's say the mean is 3. As far as I understand this journal suggests to report it simply as 3. But would you not want to see standard error as well, e.g. $3 \pm 0.5$? This of course means that 95% confidence interval is $3 \pm 1$, which also means that $p<0.05$, so it's all related. I am not sure how you suggest to report it. – amoeba Feb 27 '15 at 12:00
  • 4
    I think of standard errors are oversimplified (because they assume symmetric distributions) but useful measures of precision, like mean squared error. You can think of a precision interval based on root mean squared error without envisioning probability coverage. So I don't see where any of this discussion implies de-emphasis of standard errors. And I wasn't suggesting that we stop using CLs. But the difficulty with CLs comes mainly from attempts at probability interpretations. – Frank Harrell Feb 27 '15 at 12:48
  • Hmmm. Interesting. To me it seems like there is such a small step from standard error to CI (a constant factor!), that treating them differently would be weird. But perhaps it is a semantic point; I guess what you mean is that people *think* about standard errors and CIs differently and tend to get more confused about CIs. I wonder what this particular journal policy says about standard errors (the Editorial doesn't mention them explicitly). – amoeba Feb 27 '15 at 22:26
  • 2
    In symmetric situations, the standard error is a building block for a confidence interval. But in many cases the correct confidence interval is asymmetric so can't be based on a standard error at all. Some varieties of the bootstrap and back-transforming are two approaches of this type. Profile likelihood confidence intervals especially come to mind here. – Frank Harrell Feb 28 '15 at 12:11
  • @Frank Harrell - As for "pure likelihood inference" I agree that an emphasis toward summarization of the data's likelihood without embellishing it with thresholds appears to be the answer the editors were grasping for. A. W. F. Edwards' book "Likelihood" (1972) speaks directly to the editor's concern: "We may defer consideration of these arguments (e.g. significance testing) until later chapters, and pass immediately to the description of a procedure, based on Fisher's concept of Likelihood, which is open to none of these objects which may be levelled at significance tests." – John Mark Apr 20 '15 at 15:43
13

I see this approach as an attempt to address the inability of social psychology to replicate many previously published 'significant findings.'

Its disadvantages are:

  1. that it doesn't address many of the factors leading to spurious effects. E.g.,

    • A) People can still peek at their data and stop running their studies when an effect size strikes them as being sufficiently large to be of interest.

    • B) Large effects sizes will still appear to have large power in retrospective assessments of power.

    • C) People will still fish for interesting and big effects (testing a bunch of hypotheses in an experiment and then reporting the one that popped up) or

    • D) pretend that an unexpected weird effect was expected all along.

    Shouldn't efforts be made to address these issues first?

  2. As a field going forwards it will make a review of past findings pretty awful. There is no way to quantitatively assess the believability of different studies. If every journal implemented this approach, you'll have a bunch of social scientists saying there is evidence for X when it is totally unclear how believable X is and scientists arguing about how to interpret a published effect or arguing about whether it is important or worth talking about. Isn't this the point of having stats? To provide a consistent way to assess numbers. In my opinion, this new approach would cause a mess if it was widely implemented.

  3. This change does not encourage researchers to submit the results of studies with small effect sizes so it doesn't really address the file-drawer effect (or are they going to publish findings with large n's regardless of effect size?). If we published all results of carefully designed studies, then even though the believability of results of individual studies may be uncertain, meta-analyses and reviews of studies that supplied statistical analysis would do a much better job at identifying the truth.

amoeba
  • 93,463
  • 28
  • 275
  • 317
captain_ahab
  • 1,301
  • 1
  • 12
  • 21
  • 2
    @captain_ahab Regarding point 3, we must mention that the previous editorial (2014) of the Editor _explicitly_ encouraged the submission of "null-effect" studies. – Alecos Papadopoulos Feb 27 '15 at 11:48
  • 1
    I can't seem to find a comment in the editorial discussing any criteria for publication except for the need have larger sample sizes than normal (how they are planning on identifying acceptable n's without inferential statistics is unclear to me). To me there is no emphasis in this editorial that they don't care what the effect size is. It seems to me that they will still be looking for interesting effects and interesting stories, which I think is the bigger problem in social science work (i.e., the post-hoc search for interesting effects and stories). – captain_ahab Feb 27 '15 at 19:34
  • 2
    What seems like a better solution is that all scientists must log the hypothesis, basic rational, power of and analytics approach of a study in a PUBLIC place BEFORE running the study. And then being limited to publishing that study in the prescribe manner. If an unexpected interesting effect is found, they should publicly log then run a new study that examines that effect. This approach while controlling for false positives would also enable scientists to demonstrate their productivity without publishing new effects. – captain_ahab Feb 27 '15 at 19:39
7

I came across a wonderful quote that almost argues for the same point, but not quite -- since it is an opening paragraph in a textbook that is mostly about frequentist statistics and hypothesis testing.

It is widely held by non-statisticians, like the author, that if you do good experiments statistics are not necessary. They are quite right. [...] The snag, of course, is that doing good experiments is difficult. Most people need all the help they can get to prevent them making fools of themselves by claiming that their favourite theory is substantiated by observations that do nothing of the sort. And the main function of that section of statistics that deals with tests of significance is to prevent people making fools of themselves. From this point of view, the function of significance tests is to prevent people publishing experiments, not to encourage them. Ideally, indeed, significance tests should never appear in print, having been used, if at all, in the preliminary stages to detect inadequate experiments, so that the final experiments are so clear that no justification is needed.

-- David Colquhoun, Lectures on biostatistics, 1971

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    Your post is really a comment, rather than an answer, so I am refraining from upvoting it, but I do wish to thank you for sharing the quotation. There are so many misunderstandings evident in this passage that it would take extensive effort (not to say space) to point out and debunk them all. In one word, though, the counter to these assertions is "efficiency." If everybody had unlimited time and budget we could at least aspire to perform "good experiments." But when resources are limited, it would be foolhardy (as well as costly) to conduct only "final, ... clear" experiments. – whuber Apr 08 '15 at 17:11
  • 2
    Thanks for your comment, @whuber; I agree with what you are saying. Still, I must add that I do find it appealing to say that ideally experimental data should be so convincing as to render formal hypothesis tests redundant. This is not an unattainable ideal! In my field (where p-values are used a lot), I find that the best papers *are* convincing without them: e.g. because they present a sequence of several experiments supporting each other, which taken together, *obviously* cannot be a statistical fluke. Re comment: it was too long for a comment, and I figured it's okay as a CW answer. – amoeba Apr 08 '15 at 17:26
  • Yes, I understand why it had to be posted as an answer, and therefore did not vote to move it into a comment (which would cut off the last part of the quote). I agree that the ideal is not unattainable *in particular cases*. I also agree it's a nice ideal to bear in mind. But as a guide to how to design experiments (which is, overall, a discipline of allocating resources), it could be a terrible mistake. (This is certainly debatable.) The suggestion that a "good" experiment would never require statistical methods is, however, one that does not stand up even to cursory examination. – whuber Apr 08 '15 at 18:52
  • 1
    Perhaps one way of reading that is as saying the initial significance test that suggested a substance stimulates a certain physiological response is no longer relevant by the time you're publishing your investigations into the effects of different kinds of inhibitors on the dose-response curve. – Scortchi - Reinstate Monica Apr 27 '16 at 09:16