45

I read a lot of evolutionary/ecological academic papers, sometimes with the specific aim of seeing how statistics are being used 'in the real world' outside of the textbook. I normally take the statistics in papers as gospel and use the papers to help in my statistical learning. After all, if a paper has taken years to write and has gone through rigorous peer review, then surely the statistics are going to be rock solid? But in the past few days, I've questioned my assumption, and wondered how often the statistical analysis published in academic papers is suspect? In particular, it might be expected that those in fields such as ecology and evolution have spent less time learning statistics and more time learning their fields.

How often do people find suspect statistics in academic papers?

Glen_b
  • 257,508
  • 32
  • 553
  • 939
luciano
  • 12,197
  • 30
  • 87
  • 119
  • 8
    Of possible interest: [Nieuwenhuis et al. (2011), "Erroneous analyses of interactions in neuroscience: a problem of significance", *Nature Neuroscience*, **14**, 9.](http://sandernieuwenhuis.nl/pdfs/NieuwenhuisEtAl_NN_Perspective.pdf) – Scortchi - Reinstate Monica Apr 02 '14 at 08:37
  • 18
    Reviewers are often people that don't know much more about statistics than those writing the paper, so it can often be easy to publish poor statistics. – Behacad Apr 02 '14 at 10:17
  • 9
    Getting a paper published is the *first* step towards its acceptance by the scientific community, not the last. Most published papers will have significant flaws in some area, the use of statistics is no exception. – Dikran Marsupial Apr 03 '14 at 13:12
  • 3
    Your assumption that papers "take years to write" is way off the mark. Collecting data might take a long time but analyzing the data and writing up is typically weeks rather than years. – David Richerby Apr 03 '14 at 14:07
  • 2
    It is nowadays well known that statistics in many psychology and medicine papers is questionable at the least, plain wrong or not even that quite often. The poor-man usage of p-values and NHST is a prominent example of the problems, see [this note](http://www.nature.com/news/scientific-method-statistical-errors-1.14700). – Quartz Apr 09 '14 at 11:26

5 Answers5

39

After all, if a paper has taken years to write and has gone through rigorous peer review, then surely the statistics are going to be rock solid?

My experience of reading papers that attempt to apply statistics across a wide variety of areas (political science, economics, psychology, medicine, biology, finance, actuarial science, accounting, optics, astronomy, and many, many others) is that the quality of the statistical analysis may be anywhere on the spectrum from excellent and well done to egregious nonsense. I have seen good analysis in every one of the areas I have mentioned, and pretty poorly done analysis in almost all of them.

Some journals are generally pretty good, and some can be more like playing darts with a blindfold - you might get most of them not too terribly far off the target, but there's going to be a few in the wall, the floor and the ceiling. And maybe the cat.

I don't plan on naming any culprits, but I will say I have seen academic careers built on faulty use of statistics (i.e. where the same mistakes and misunderstandings were repeated in paper after paper, over more than a decade).

So my advice is let the reader beware; don't trust that the editors and peer reviewers know what they're doing. Over time you may get a good sense of which authors can generally be relied on to not do anything too shocking, and which ones should be treated especially warily. You may get a sense that some journals typically have very high standard for their stats.

But even a typically good author can make a mistake, or referees and editors can fail to pick up errors they might normally find; a typically good journal can publish a howler.

[Sometimes, you'll even see really bad papers win prizes or awards... which doesn't say much for the quality of the people judging the prize, either.]

I wouldn't like to guess what the fraction of "bad" stats I might have seen (in various guises, and at every stage from defining the question, design of the study, data collection, data management, ... right through to analysis and conclusions), but it's not nearly small enough for me to feel comfortable.

I could point to examples, but I don't think this is the right forum to do that. (It would be nice if there was a good forum for that, actually, but then again, it would likely become highly "politicized" quite quickly, and soon fail to serve its purpose.)

I've spent some time trawling through PLOS ONE ... and again, not going to point at specific papers. Some things I noticed: it looks like a large proportion of papers have stats in them, probably more than half having hypothesis tests. The main dangers seem to be lots of tests, either with high $\alpha$ like 0.05 on each (which is not automatically a problem as long as we understand that quite a few really tiny effects might show up as significant by chance), or an incredibly low individual significance level, which will tend to give low power. I also saw a number of cases where about half a dozen different tests were apparently applied to resolving exactly the same question. This strikes me as a generally bad idea. Overall the standard was pretty good across a few dozen papers, but in the past I have seen an absolutely terrible paper there.

[Perhaps I could indulge in just one example, indirectly. This question asks about one doing something quite dubious. It's far from the worst thing I've seen.]

On the other hand, I also see (even more frequently) cases where people are forced to jump through all kinds of unnecessary hoops to get their analysis accepted; perfectly reasonable things to do are not accepted because there's a "right" way to do things according to a reviewer or an editor or a supervisor, or just in the unspoken culture of a particular area.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 2
    "*Caveat lector*", given the increasing number of open-access journals? – Scortchi - Reinstate Monica Apr 02 '14 at 09:41
  • 1
    @scortchi I decided to avoid the issue altogether by simply writing in English. It's an improvement. – Glen_b Apr 02 '14 at 10:08
  • 10
    Without naming specific culprits, I think http://faculty.vassar.edu/abbaird/about/publications/pdfs/bennett_salmon.pdf deserves a mention. To prove a point about misuse of statistics in their field, they used a widely used statistical protocol to analyse the results of an fMRI scan of a dead salmon. They found "statistically significant" brain activity. http://www.statisticsdonewrong.com also makes interesting reading. – James_pic Apr 02 '14 at 10:51
  • 1
    @James_pic, had to join to +1 that comment for the statisticsdonewrong link; the discussion of the base rate fallacy is particularly interesting. – Dan Bryant Apr 02 '14 at 13:34
  • @Scortchi do you have evidence that there is a causal relationship between open-access and poor statistical practice, or are just illustrating how one could put forth a conclusion without the proper statistical support? – DQdlM Apr 02 '14 at 13:54
  • 1
    @KennyPeanuts: Neither - just pointing out that nowadays many *lectores* aren't even indirectly *emptores*. – Scortchi - Reinstate Monica Apr 02 '14 at 14:19
  • 1
    @KennyPeanuts The intent of Scortchi's comment was more obvious when my post contained the phrase *caveat emptor*, and was simply pointing out that "reader" (*lector*) was more accurate than "buyer" (*emptor*). The comment was correct, and led me to consider whether I should not simply write the phrase in English to begin with. – Glen_b Apr 02 '14 at 20:28
  • 1
    @Scotrtchi - more open access journals seem to have dedicated statistical reviewers than regular journals, at least in my field. (Disclaimer: I'm a statistical reviewer for some open access journals; in theory I'm a statistical reviewer for some regular journals too, but they never ask me to do anything. I think that, for example, PLOS One gets a statistical reviewer on every paper (that uses statistics). If they don't like the look of it, they'll ask the authors for the data and ask someone to reanalyze it - I've done that a couple of times too. – Jeremy Miles Apr 04 '14 at 00:40
  • 1
    @JeremyMiles: See Glen's comment above. "A lot of journals are open access nowadays; so don't have *buyers*, only *readers*" is all I was (pedantically) saying. I've really no opinion on how that might have affected anything in general - my only personal experience bearing on the matter is that I haven't noticed the slightest change in IMS journals since they went open access. – Scortchi - Reinstate Monica Apr 04 '14 at 10:34
  • 1
    I found this initiative a great leap forward in open discussion: [OpenReview](http://openreview.net/). – Quartz Apr 09 '14 at 11:31
16

I respect @Glen_b's stance on the right way to answer here (and certainly don't intend to detract from it), but I can't quite resist pointing to a particularly entertaining example that's close to my home. At the risk of politicizing things and doing this question's purpose a disservice, I recommend Wagenmakers, Wetzels, Boorsboom, and Van Der Maas (2011). I cited this in a related post on the Cognitive Sciences beta SE (How does cognitive science explain distant intentionality and brain function in recipients?), which considers another example of "a dart hitting the cat". Wagenmakers and colleagues' article comments directly on a real "howler" though: it was published in JPSP (one of the biggest journals in psychology) a few years ago. They also argue more generally in favor of Bayesian analysis and that:

In order to convince a skeptical audience of a controversial claim, one needs to conduct strictly confirmatory studies and analyze the results with statistical tests that are conservative rather than liberal.

I probably don't need to tell you that this didn't exactly come across as preaching to the choir. FWIW, there is a rebuttal as well (as there always seems to be between Bayesians and frequentists; (Bem, Utts, & Johnson, 2011), but I get the feeling that it didn't exactly checkmate the debate.

Psychology as a scientific community has been on a bit of a replication kick recently, partly due to this and other high-profile methodological shortcomings. Other comments here point to cases similar to what were once known as voodoo correlations in social neuroscience (how's that for politically incorrect BTW? the paper has been retitled; Vul, Harris, Winkielman, & Pashler, 2009). That too attracted its rebuttal, which you can check out for more debate of highly debatable practices.

For even more edutainment at the (more depersonalized) expense of (pseudo)statisticians behaving badly, see our currently 8th-most-upvoted question here on CV with another (admittedly) politically incorrect title, "What are common statistical sins?" Its OP @MikeLawrence attributes his inspiration to his parallel study of psychology and statistics. It's one of my personal favorites, and its answers are very useful for avoiding the innumerable pitfalls out there yourself.


On the personal side, I've been spending much of my last five months here largely because it's amazingly difficult to get rock-solid statistics on certain data-analytic questions. Frankly, peer review is often not very rigorous at all, especially in terms of statistical scrutiny of research in younger sciences with complex questions and plenty of epistemic complications. Hence I've felt the need to take personal responsibility for polishing the methods in my own work.

While presenting my dissertation research, I got a sense of how important personal responsibility for statistical scrutiny is. Two exceptional psychologists at my alma mater interjected that I was committing one of the most basic sins in my interpretations of correlations. I'd thought myself above it, and had lectured undergrads about it several times already, but I still went there, and got called out on it (early on, thank heavens). I went there because research I was reviewing and replicating went there! Thus I ended up adding several sections to my dissertation that called out those other researchers for assuming causality from quasi-experimental longitudinal studies (sometimes even from cross-sectional correlations) and ignoring alternative explanations prematurely.

My dissertation was accepted without revisions by my committee, which included another exceptional psychometrician and the soon-to-be-president of SPSP (which publishes JPSP), but to be frank once more, I'm not bragging in saying this. I've since managed to poke several rabbit holes in my own methods despite passing the external review process with perfectly good reviewers. I've now fallen into the deep end of stats in trying to plug them with methods more appropriate for predictive modeling of Likert ratings like SEM, IRT, and nonparametric analysis (see Regression testing after dimension reduction). I'm opting voluntarily to spend years on a paper that I could probably just publish as-is instead...I think I even have a simulation study left to do before I can proceed conscientiously.

Yet I emphasize that this is optional – maybe even overzealous and a costly luxury amidst the publish-or-perish culture that often emphasizes quantity over quality in early-career work records. Misapplication of parametric models for continuous data to assumption-violating distributions of ordinal data is all too common in my field, as is the misinterpretation and misrepresentation of statistical significance (see Accommodating entrenched views of p-values). I could totally get away with it (in the short term)...and it's not even all that hard to do better than that. I suppose I have several recent years of amazing advances in R programs to thank for that though! Here's hoping the times are changing.


References
· Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101(4), 716–719. Retrieved from http://deanradin.com/evidence/Bem2011.pdf.
· Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4(3), 274–290. Retrieved from http://www.edvul.com/pdf/VulHarrisWinkielmanPashler-PPS-2009.pdf.
· Wagenmakers, E. J., Wetzels, R., Borsboom, D., & Van der Maas, H. (2011). Why psychologists must change the way they analyze their data: The case of psi. Journal of Personality and Social Psychology, 100, 426–432. Retrieved from http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/Bem6.pdf.

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
  • 1
    If you enjoyed "Feeling the Future", then you might like [Witztum et al. (1994), "Equidistant Letter Sequences in the Book of Genesis", *Statist. Sci.*, **9**,3](http://projecteuclid.org/euclid.ss/1177010393). It attracted the inevitable scoffers & nay-sayers: [McKay et. al. (1999), "Solving the Bible Code Puzzle", *Statist. Sci.*, **14**,2](http://projecteuclid.org/euclid.ss/1009212243). – Scortchi - Reinstate Monica Apr 02 '14 at 15:56
  • @Scortchi: unlike the case with "Feeling the Future", the Equidistant Letter paper is co-authored (and largely inspired) by a famous mathematician [Eliyahu Rips](http://en.wikipedia.org/wiki/Eliyahu_Rips). Not that it automatically makes it correct, but it does make the case more interesting. – amoeba Apr 02 '14 at 16:17
  • @amoeba: Why should one think a famous mathematician's finding secret codes in the Bible a more interesting case than a famous psychologist's finding evidence for ESP? – Scortchi - Reinstate Monica Apr 02 '14 at 16:46
  • @Scortchi: I had no idea how famous Daryl Bem is (simply because it is not my field). Anyway, "famous" is a wrong term, what I really meant is "great" :) If he is indeed recognized in the psychological community as being a prominent researcher (as I know Rips is among mathematicians), then I withdraw my claim. – amoeba Apr 02 '14 at 16:53
  • 1
    @Scortchi: thanks for the reference, and amoeba: thanks for the context. I don't see the claim in Witzum et al. that McKay et al. scoff at in their abstract, but they sure point out a lot of other serious flaws. Good stuff. "Whereas real data may confound the expectations of scientists even when their hypotheses are correct, those whose experiments are systematically biased towards their expectations are less often disappointed (Rosenthal, 1976)." That's one of the guys who called me out on causal inference based on quasi-experiments...a truly great psychologist. Bem has some cred too though. – Nick Stauner Apr 02 '14 at 23:14
  • 2
    +1 Excellent post. "*how important personal responsibility for statistical scrutiny is*" -- I must applaud. Ultimately, this is where responsibility must lie, as onerous as that may be for someone already trying to get work done in an area of research to which they wish to apply statistics. – Glen_b Apr 03 '14 at 04:09
  • 1
    @NickStauner: McKay et al. say in their abstract that Witzum et al. claim "the Hebrew text of the Book of Genesis encodes events which did not occur until millennia after the text was written". Slight hyperbole perhaps, as it's just over two millenia at most between the writing of the Torah & the birth-date of the last rabbi from their list, but a fair enough summary. (I suppose you could also see the Witztum et al. paper as evidence for recent authorship of the Book of Genesis, though as far as I know no-one has done.) – Scortchi - Reinstate Monica Apr 03 '14 at 11:31
  • 1
    Yeah, I guess I couldn't understand Witzum et al. well enough to recognize that they were making that claim. For once I suppose I could be thankful for the authors' obtuse writing...It comes across as a little more interesting at face value because the most prominent claim is that the pattern is not due to chance, not what the pattern is supposedly due to in their opinion. It could've invited more interesting interpretations like yours had it not overreached as McKay et al. say it does...at least until McKay et al. shot them down on methodological grounds, leaving nothing worth interpreting. – Nick Stauner Apr 03 '14 at 11:38
  • 1
    luciano - given the quality answer here, with its breadth, copious links and references, it would be well worth accepting if you're of a mind to accept an answer to your question. – Glen_b Apr 27 '14 at 04:01
  • Pshaw. Mine is largely specific to psychology and personal experience. Yours is the one with the general perspective and all those upvotes! I'd be comfortable citing it if I were publishing a commentary / lit review on the OP's topic, but you'd be better qualified to do that and could surely bring many more references to it than me if you did (hint hint, if you haven't already). – Nick Stauner Apr 27 '14 at 04:07
5

I recall at University being ask by a few final year social science students on different occasions (one of them got a 1st) how to work out an average for their project that had had a handful of data points. (So they were not having problem with using software, just with the concept of how to do the maths with a calculator.)

They just give me blank looks when I ask them what type of average they wanted.

Yet they all felt a need to put some stats in their report, as it was the done thing – I expect they have all read 101 papers that had stats without thinking about what the stats meant if anything.

It is clear that the researcher that taught them over the 3 years did not care about the correctness of stats enough to distil any understanding into the students.

(I was a computer Sci student at the time. I am posting this as an answer as it is a bit long for a comment.)

Glen_b
  • 257,508
  • 32
  • 553
  • 939
Ian Ringrose
  • 151
  • 3
  • Students are a whole other barrel of monkeys, IMO. I wouldn't blame the teacher immediately for their lack of understanding without further evidence...but if it's as clear as you say that the teacher is to blame, I wouldn't be surprised either. – Nick Stauner Apr 03 '14 at 20:01
  • @NickStauner, I blame the teacher for not caring enough about stats; if they cared there would be at least one question on each exam paper that needed some understanding of stats, at the level of “How to Lie with Statistics”. I don’t care if social science students know how to do the calc, but they should know how not to be mislead. – Ian Ringrose Apr 03 '14 at 21:35
  • Agreed that they *should* know, but there's no guaranteeing they'll get that question right! – Nick Stauner Apr 03 '14 at 21:38
  • @NickStauner, Yes, but you only get what the measure, so you will not get students that understand anything about stats unless you put it in the exams. – Ian Ringrose Apr 03 '14 at 22:04
  • Again, I tend to give teachers less credit for student outcomes. Plenty of students (okay, maybe not "plenty", but some) will care enough to learn for its own sake, and some will come to class already knowing much of the material. Forgive me if I interpret your comment too absolutely though; I would agree that it is often a necessary evil to force motivation to learn onto students, and that testing is a better way to learn than rote, repetitive studying/lecturing. – Nick Stauner Apr 03 '14 at 22:54
0

As a woefully incomplete list, I find statistics most frequently correct in 1) physics papers followed by 2) statistical papers and most miserable in 3) medical papers. The reasons for this are straightforward and have to do with the completeness of the requirements imposed upon the prototypical model in each field.

In physics papers, equations and applied statistics have to pay attention to balanced units and have the most frequent occurrence of causal relationships, and testing against physical standards.

In statistics, 1) units and causality are sometimes ignored, the assumptions are sometimes heuristic, and physical testing is too often ignored, but equality (or inequality), i.e., logic is generally preserved along an inductive path, where the latter cannot correct for unphysical assumptions.

In medicine, typically units are ignored, the equations and assumptions are typically heuristic, typically untested and frequently spurious.

Naturally, a field like statistical mechanics is more likely to have testable assumptions than, let us say, economics, and, that does not reflect on the talents of the prospective authors in those fields. It is more related to how much of what is being done is actually testable, and how much testing has been done historically in each field.

Carl
  • 11,532
  • 7
  • 45
  • 102
-7

Any paper that disproves the nil null hypothesis is using worthless statistics (the vast majority of what I have seen). This process can provide no information not already provided by the effect size. Further it tells us nothing about whether a significant result is actually due to the cause theorized by the researcher. This requires thoughtful investigation of the data for evidence of confounds. Most often, if present, the strongest of this evidence is even thrown away as "outliers".

I am not so familiar with evolution/ecology, but in the case of psych and medical research I would call the level of statistical understanding "severely confused" and "an obstacle to scientific progress". People are supposed to be disproving something predicted by their theory, not the opposite of it (zero difference/effect).

There have been thousands of papers written on this topic. Look up NHST hybrid controversy.

Edit: And I do mean the nill null hypothesis significance test has a maximum of zero scientific value. This person hits the nail on the head:

http://www.johnmyleswhite.com/notebook/2012/05/18/criticism-4-of-nhst-no-mechanism-for-producing-substantive-cumulative-knowledge/

Also: Paul Meehl. 1967. Theory Testing in Psychology and Physics: A Methodological Paradox

Edit 3:

If someone has arguments in favor of the usefulness of strawman NHST that do not require thinking "reject the hypothesis that the rate of warming is the same, but DO NOT take this to imply that the rate of warming is the not same" is a rational statement, I would welcome your comments.

Edit 4:

What did Fisher mean by the following quote? Does it suggest that he thought "If model/theory A is incompatible with the data, we can say A is false, but nothing about whether not A is true"?

"it is certain that the interest of statistical tests for scientific workers depends entirely from their use in rejecting hypotheses which are thereby judged to be incompatible with the observations."

...

It would, therefore, add greatly to the clarity with which the tests of significance are regarded if it were generally understood that tests of significance, when used accurately, are capable of rejecting or invalidating hypotheses, in so far as these are contradicted by the data; but that they are never capable of establishing them as certainly true

Karl Pearson and R. A. Fisher on Statistical Tests: A 1935 Exchange from Nature

Is it that he assumed people would only try to invalidate plausible hypotheses rather than strawmen? Or am I wrong?

Livid
  • 1,078
  • 6
  • 15
  • Unfortunately this is true. If you disagree please comment explaining why. – Livid Apr 02 '14 at 17:02
  • 1
    Disagree. http://simplystatistics.org/2014/02/14/on-the-scalability-of-statistical-procedures-why-the-p-value-bashers-just-dont-get-it/ – Glen Apr 02 '14 at 20:13
  • @Glen Nothing there addresses my complaint or those in the links I provided. Please make a distinction between using NHST to disprove a prediction made by theory (This is OK) and the strawman nil null hypothesis. – Livid Apr 02 '14 at 21:11
  • 7
    "This process can provide no information not already provided by the effect size." this is incorrect, the p-value provides some information about how unusual this effect size would be under the null hypothesis, thus it provides an element of calibration of effect size. Don't misunderstand me, I think Bayes factors are more useful, but it is hyperbole to say that the p-value is a worthless statistic. – Dikran Marsupial Apr 03 '14 at 13:09
  • @DikranMarsupial This is not useful information: "how unusual this effect size would be under the null hypothesis". Keep in mind specifically what null hypotheses I am talking about. – Livid Apr 03 '14 at 13:42
  • @DikranMarsupial I am open to convincing. Explain how knowing how unusual a given effect size would be if there was exactly no difference between a treatment and control group can lead to a quantitative theory, when simply knowing the (estimate of the) effect size would not. – Livid Apr 03 '14 at 14:14
  • 1
    I disagree, for all its faults, provides an element of self-skepticism that is useful in science, if you only look at uncalibrated effect size, how small does the effect have to be before it is no longer worth mentioning? I do a fair amount of blogging on climate science and there is plenty of argument based on effect sizes that do not justify the strenght of argument being made, removing the low hurdle of the NHST would be a retrograde step. The p-value shows that it is at least large enough to effectively rule out an alternative that you don't want to be true. – Dikran Marsupial Apr 03 '14 at 14:27
  • 1
    "People are supposed to be disproving something predicted by their theory, not the opposite of it (zero difference/effect).". This suggests that perhaps you are thinking of confidence intervals, rather than effect sizes. In that case you could determine whether the observations falsify the hypothesis with a CI, but not with just an effect size. However, what do you then do if the observations lie within the CIs of H0 and H1? – Dikran Marsupial Apr 03 '14 at 14:37
  • @DikranMarsupial In the absence of a theory that can predict something beyond directional effects, I prefer to simply describe the data and look for patterns (ie exploratory). I am using "effect size" as shorthand for these patterns. I find that all patterns I (and others) notice are worth mentioning, while readily acknowledging that they may be due to "noise". I find CIs (and all other summary statistics) in the absence of the data itself to be inadequate towards my goal of working towards quantitative theories. – Livid Apr 03 '14 at 14:51
  • 1
    If you are going to use terminology in a non-standard way, it will not be surprising if you are misunderstood. It seems to me that your criticism of p-values as telling us "nothing about whether a significant result is actually due to the cause theorized by the researcher." is somewhat unfair if the methods you use, by your own admission, cannot establish that the effects are not due to noise (which is what the NHST does, the null hypothesis generally characterises what can be explained by noise, so if it can be rejected you have made that distinction). – Dikran Marsupial Apr 03 '14 at 15:15
  • 3
    "I find that all patterns I (and others) notice are worth mentioning" this is exactly the problem that arises in the discussion of climate on blogs, the human eye is very good at seeing patterns in data that turn out to be just noise, and it does the signal-to-noise ratio in the debate no good at all not to have some hurdle for an idea to get over before posting it on a blog! It is one area of science where the statistics are often very poor. – Dikran Marsupial Apr 03 '14 at 15:27
  • @DikranMarsupial "Noise" is not a property of only the data, it is also determined by the information available to the researcher. Strawman nil NHST cannot "characterize what can be explained by noise" in any objective sense. Just because you reject something is noise does not mean I will. But this is not my point. Even if we know for sure an effect is not due to noise (and even moreso, the direction of the effect), it is still less useful than knowing the direction AND estimated magnitude of the effect (which we must know to calculate the p-value anyway). It is extraneous. – Livid Apr 03 '14 at 16:34
  • 1
    ""Noise" is not a property of only the data, it is also determined by the information available to the researcher" no, this is utterly incorrect. Our interpretation of the observation depends on the information available to the researcher, but the noise is defined by the data generating process, which is independent of our beliefs regarding its true nature. I think I will leave the discussion there. – Dikran Marsupial Apr 03 '14 at 16:40
  • @DikranMarsupial Whether you characterize something as noise or not depends on the significance level. Is the the significance level determined by the data or the researcher? – Livid Apr 03 '14 at 16:42
  • 1
    Failing to reject the null hypothesis **does not** mean that you accept the null hypothesis (or that you reject H1 as hypothesis tests are not symmetric). You might interpret a failure to reject the null hypothesis as meaning that the variation can be plausibly interpreted as an artefact of the noise process, but that is not at all the same as characterising the variation as being noise. This is a common misconception about the NHST, but it is still a misconception. – Dikran Marsupial Apr 03 '14 at 16:51
  • @DikranMarsupial I am aware of that misconception. I am talking about the case where you "reject" the strawman nil null. You may require only p<0.05 to reject the "noise" hypothesis, while I require p<0.01. It is clear that the characterization of noise is determined by the researcher/audience. None of this is the point, however, the point is that even under ideal circumstances where noone could disagree (food keeps animals alive longer), the mere direction is less helpful than also knowing the magnitude (how much longer). – Livid Apr 03 '14 at 16:58
  • 1
    The Bayes framework is no better, the interpretation of the Bayes factor is equally subjective. The choice of critical value (as Fisher rightly pointed out) depends on the nature of the problem, and for good reason. In a way, it partially captures the information in P(H1) and P(H0). This is not specific to the NHST, although an unthinking use of 0.05 because it is a "tradition" (i.e. the null ritual) is not a good thing. – Dikran Marsupial Apr 03 '14 at 17:03
  • @DikranMarsupial The replacement I argue for is exploratory data analysis until someone comes up with theories that make predictions. I also argue, I think in agreement with those in the links provided, any process that compares results to a strawman contributes a maximum of nothing to science. But perhaps I have not thought of some edge cases. – Livid Apr 03 '14 at 17:12
  • One last try, if the observations don't even rule out a straw man H0, what does that say about our H1? The fact that the NHST is sometimes a low hurdle means that it is still of value in demonstrating the paucity of hypotheses promulgated (e.g. the existence of a pause in global warming since some cherry picked start date), even though the appropriate null hypothesis (no change in the rate of warming) cannot be rejected. If we were all happy with exploratory data analysis, what would there be to prevent cases of over-confidence in H1 such as that? Exploratory analysis is not sufficient. – Dikran Marsupial Apr 03 '14 at 17:23
  • 1
    @DikranMarsupial "the appropriate null hypothesis (no change in the rate of warming)" I would ask what theory you have which predicts that the rate of warming should remain exactly the same, how successful this theory is at explaining historical data, and what future observations it predicts. In reality I suspect that is a useless strawman hypothesis. To avoid a long comment, here is more Meehl: [Psychological Inquiry 1990, Vol. 1, No. 2, 108-141](http://rhowell.ba.ttu.edu/meehl1.pdf) – Livid Apr 03 '14 at 17:37
  • 1
    @Livid, there is no theory that suggests it should stay *exactly* the same, the point is that if the skeptics can't even reject the straw man H0, why should we expect a more realistic H0 to be rejected? Thus even the NHST with a "straw man" H0 demonstrates that the skeptics are overstating the evidence provided by the observations. – Dikran Marsupial Apr 03 '14 at 17:44
  • @DikranMarsupial That last comment makes little sense to me. If the "more realistic" H0 predicts a greater rate of warming than exactly zero I would obviously expect it to be more easily "rejected" by data consisting of near zero rate of change. Anyway, your argument seems to be that strawman nil NHST performs a socially desirable function by setting up arbitrary obstacles for people to overcome before committing affirming the consequent fallacies. Is this accurate? – Livid Apr 03 '14 at 17:56
  • 1
    @livid the test is to see if there has been a **change** in the rate of warming. The whole point of statistical hypothesis testing is that is performs a useful function in science and statistics. The interpretation of the p-value is subjective, but that does not mean that it is not useful. – Dikran Marsupial Apr 03 '14 at 18:03
  • @DikranMarsupial "The whole point of statistical hypothesis testing is that is performs a useful function in science and statistics." I disagree that it is even possible for this to be the case **when the null hypothesis is a strawman**. There is a proof of this in the first link I provided in my post. Please comment on the faults of that argument rather than repeating the assertion I explicitly disagreed with. – Livid Apr 03 '14 at 18:10
  • 1
    I usually don't downvote but imo you are misrepresenting things: "Any paper ... disproves ...nil null hypothesis ... using worthless statistics" You start with the wrong assertion that hypotheses get "proved". "This process can provide no information not already provided by the effect size." This is wrong and Dikran pointed that out. "Further it tells us nothing about whether a significant result is actually due to the cause theorized by the researcher". True, but that applies to any and all statistics. "Most often ... evidence is even thrown away as "outliers"." Any evidence to back that up? – Momo Apr 03 '14 at 23:25
  • "I would call the level of statistical understanding "severely confused" and "an obstacle to scientific progress"". The person you cite is a psychologist... "Your People are supposed to be disproving something predicted by their theory, not the opposite of it (zero difference/effect)." Apart from the prove thing again, I think what you suggest is problematic in philosophy of science grounds. See http://en.wikipedia.org/wiki/Empiricism – Momo Apr 03 '14 at 23:30
  • @Momo Thank you for explaining the basis for your opinion. Nowhere do I state hypotheses can be proved. I am not convinced by Dikran's argument here that further info is provided by rejecting the null than is already present in the effect size, and I repeat my emphasis **when the null hypothesis is a strawman**. Regarding asserting cause to a significant result, yes, that is why the null needs to be predicted by theory for it to be useful (If p then q, not q then not p, is valid). Read the links. Regarding evidence thrown away by outliers, perhaps hyperbole, but that is my personal experience. – Livid Apr 03 '14 at 23:31
  • @Momo Also I am well aware of the Duhem/Quine problem with falsification as well. If a theory is "disproved" by the evidence, this is not convincing without "thoughtful investigation of the data for evidence of confounds". You can find that comment in the original post. – Livid Apr 03 '14 at 23:51
  • 2
    Livid, I gave you a concrete example of where performing an appropriate NHST with a "straw man" H0 would be beneficial to the discussion of a scientific topic. That provides a clear counterexample that demonstrates your view to be incorrect - NHSTs, as flawed as they are, *do* nevertheless perform a useful function in science and statistics. Now if you can demonstrate that my counterexample is correct, that may go some way towards resolving the issue. – Dikran Marsupial Apr 04 '14 at 08:18
  • @DikranMarsupial What example??? The rate of warming? That null hypothesis is the epitome of what I am talking about. Disproving it serves no purpose. – Livid Apr 04 '14 at 14:10
  • 1
    Livid, perhaps you need to read what is written more carefully, the skeptics are claiming that there has been a pause in warming, i.e. a change in the rate of warming. Had they bothered to perform the basic NHST, with H0 being that the rate of warming has remained unchanged, they would see that the evidence supporting their hypothesis is not as strong as they think, and the discussion of climate change would benefit from the prevention of over-claiming the significance of the observations. That is the purpose of the NHST, it is a basic sanity check. – Dikran Marsupial Apr 04 '14 at 15:50
  • @DikranMarsupial Can I get an answer to this earlier question? If I am mischaracterizing your argument please explain where I have gone wrong because this is what I am reading: "your argument seems to be that strawman nil NHST performs a socially desirable function by setting up arbitrary obstacles for people to overcome before committing affirming the consequent fallacies. Is this accurate?" – Livid Apr 04 '14 at 17:21
  • 2
    @Livid, NHST performs a scientifically and statistically, not socially desirable function (though not optimally) and it doesn't set an arbitrary obstacle, the hurdle is generally defined by its opposition to H1 and it doesn't involve committing "affirming the consequent fallacies" as rejecting H0 does not imply that H1 is true. So no, it isn't accurate. – Dikran Marsupial Apr 04 '14 at 17:59
  • @DikranMarsupial "skeptics are claiming that there has been a pause in warming". So one day (we need to wait until sample size is large enough or we decide to move our significance level) we get to say "the warming rate is not exactly the same", therefore... what? Does this tell us whether there is a "pause"? If there is a pause the warming rate would change, the warming rate has changed, therefore there is a pause. This is textbook affirming the consequent, what other use for this could there be? – Livid Apr 04 '14 at 18:37
  • 3
    You are missing the point. If you have a low hurdle, then nobody is surprise if you can negotiate it successfully. However if you have a low hurdle, but you still can't get over it, *that* tells you something. As I have repeatedly said, rejecting the null does not imply that H1 is true, so rejecting H0 doesn't mean that there definitely is a pause, it doesn't tell you why there has been a pause. But if you can't get over the hurdle of being able to reject H0, it suggests that perhaps there is insufficient evidence to assert H1 as fact (which is what is happening in this instance). – Dikran Marsupial Apr 04 '14 at 18:51
  • @DikranMarsupial "rejecting the null does not imply that H1 is true". Either the rate of warming is the same or its not the same. We reject the hypothesis that the rate of warming is the same, but DO NOT take this to imply that the rate of warming is the not same. LOL.This is why strawman NHST has less than the max of zero value, it has negative value for science. – Livid Apr 04 '14 at 19:07
  • 2
    Consider the example given in this question http://stats.stackexchange.com/questions/43339/whats-wrong-with-xkcds-frequentists-vs-bayesians-comic and you might see why rejecting the null doesn't mean that H1 is true. The "LOL" following as it does, rhetorical misrepresentation of what I have written, suggests that you are impervious to counterargument, so I will leave it there I think. – Dikran Marsupial Apr 04 '14 at 19:12
  • @DikranMarsupial 1) You may need to be made aware of the history of the NHST hybrid that you are describing (EF Lindquist, a non-statistician, accidentally combined Fisher and Neyman-Pearson into one approach while writing an introductory textbook). 2) You are confusing the statistical hypothesis and research hypothesis. 3) It is not possible for any sane person to "reject the hypothesis that the rate of warming is the same, but DO NOT take this to imply that the rate of warming is the not same". This is not even the complaint I had anymore, you are severely confused. – Livid Apr 04 '14 at 19:26
  • 1
    'It is not possible for any sane person to "reject the hypothesis that the rate of warming is the same, but DO NOT take this to imply that the rate of warming is the not same' - obviously you didn't see the example then, which is a pity. As it happens to conclude that H1 is true, you need to know p(X|H1), p(H1) and p(H0), as well as p(X|H0), none of which are involved in the NHST. Thus any sane Bayesian could refuse to accept H1 simply because H0 was rejected by the NHST. A moments consideration of the XKCD example would have told you that. – Dikran Marsupial Apr 04 '14 at 19:33
  • @DikranMarsupial No, I am not talking about the fallacy of the transposed conditional. That is not the problem here. We have gone well astray from discussing the actual problems with strawman NHST and are now discussing your confusion which is not my purpose here. – Livid Apr 04 '14 at 19:39
  • @Livid Have a look at my paper that explores the evidential properties of P-values: arxiv.org/abs/1311.0081 It shows how P-values are actually useful when correctly interpreted and when the "hypothesis" axis is properly viewed as the set of values that can be taken by the parameter of interest. – Michael Lew Apr 18 '14 at 21:27