102

I'm very new to statistics, and I'm just learning to understand the basics, including $p$-values. But there is a huge question mark in my mind right now, and I kind of hope my understanding is wrong. Here's my thought process:

Aren't all researches around the world somewhat like the monkeys in the "infinite monkey theorem"? Consider that there are 23887 universities in the world. If each university has 1000 students, that's 23 million students each year.

Let's say that each year, each student does at least one piece of research, using hypothesis testing with $\alpha=0.05$.

Doesn't that mean that even if all the research samples were pulled from a random population, about 5% of them would "reject the null hypothesis as invalid". Wow. Think about that. That's about a million research papers per year getting published due to "significant" results.

If this is how it works, this is scary. It means that a lot of the "scientific truth" we take for granted is based on pure randomness.

A simple chunk of R code seems to support my understanding:

library(data.table)
dt <- data.table(p=sapply(1:100000,function(x) t.test(rnorm(10,0,1))$p.value))
dt[p<0.05,]

So does this article on successful $p$-fishing: I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.

Is this really all there is to it? Is this how "science" is supposed to work?

amoeba
  • 93,463
  • 28
  • 275
  • 317
n_mu_sigma
  • 1,071
  • 2
  • 8
  • 6
  • 31
    The true problem is potentially far worse than multiplying the number of true nulls by the significance level, due to pressure to find significance (if an important journal won't publish non-significant results, or a referee will reject a paper that doesn't have significant results, there's pressure to find a way to achieve significance ... and we do see 'significance hunting' expeditions in many questions here); this can lead to true significance levels that are quite a lot higher than they appear to be. – Glen_b Jul 19 '15 at 10:58
  • 5
    On the other hand, many null hypotheses are point nulls, and those are very rarely actually true. – Glen_b Jul 19 '15 at 11:35
  • 39
    Please do not conflate the scientific method with p-values. Among other things, science insists on *reproducibility*. That is how papers on, say, [cold fusion](https://en.wikipedia.org/wiki/Cold_fusion) could get published (in 1989) but cold fusion has not existed as a tenable scientific theory for the last quarter century. Note, too, that few scientists are interested in working in areas where the relevant null hypothesis actually is *true*. Thus, your hypothesis that "all the research samples were pulled from [a] random population" does not reflect anything realistic. – whuber Jul 19 '15 at 13:22
  • 13
    Compulsory reference to the [xkcd jelly beans cartoon](https://xkcd.com/882/). Short answer - this is unfortunately happening too often, and some journals are now insisting on having a statistician reviewing every publication to reduce the amount of "significant" research that makes its way into the public domain. Lots of relevant answers and comments [in this earlier discussion](http://stats.stackexchange.com/q/100151/45797) – Floris Jul 19 '15 at 16:54
  • 2
    I would like to point out that, while many of the answers give important corrections' to the poster's idea (about the scientific process), his understanding is basically correct. It *is* the logic of null hypothesis tests to control the probability of false positives, and there's nothing wrong with considering this with respect to the total number of studies performed in a given period, leading to an expected number of false postives. This is why imho the whole business of "multiple comparisons correction" is unprincipled, because it is not specified what the relevant unit is. – A. Donda Jul 19 '15 at 20:29
  • And of course one can say, many if not all of these null hypotheses are false anyway – but then why test them? – A. Donda Jul 19 '15 at 20:32
  • 2
    What about the other 19 million papers? – Alecos Papadopoulos Jul 19 '15 at 23:52
  • 8
    Perhaps I don't get the complaint... "We successfully defeated 95% of bogus hypotheses. The remaining 5% were not so easy to defeat due to random fluctuations looking like meaningful effects. We should look at those more closely and ignore the other 95%." This sounds exactly like the right sort of behaviour for anything like "science". – Eric Towers Jul 20 '15 at 16:55
  • 1
    @whuber, just in case anyone interested, [here](https://www.amherst.edu/media/view/141864/original/FLEISCHMANN1989-1.pdf)'s the paper on cold fusion. It doesn't have any statistics whatsoever, no p-values. – Aksakal Jul 21 '15 at 18:41
  • Required reading http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf 'The Null Ritual What You Always Wanted to Know About Significance Testing but Were Afraid to Ask', Gerd Gigerenzer, Stefan Krauss, and Oliver Vitouch – Dale M Jul 23 '15 at 04:00
  • 1
    @Dale: This paper by Gigerenzer (as well as many others by him on the same topic) I find incredibly annoying because he just goes on and on about how combining Fisher and Neyman-Pearson into one "hybrid" leads to an "incoherent mishmash" and hammers it in with his favourite Freudian analogy, but it remains not clear at all why this should be so incoherent. I once asked [a question about that](http://stats.stackexchange.com/questions/112769) and nobody could convince me. The significance testing "ritual" might have its flaws, but Gigerenzer's anti-"ritual" ritual is *at least* as annoying. – amoeba Jul 23 '15 at 11:18
  • 1
    The OP implicitly assumes all/most scientific publications are based on a significance test and associated p-value, which is *completely* incorrect. In areas such as experimental particle physics where statistical tests are important, you will see that they compute confidence intervals and do not base their conclusions on p-values. In other areas, there are no such statistical tests or the statistical tests are not of central importance. – Pete Jul 23 '15 at 17:12
  • The p value calculates P(O|H), ie the probability of the observation given the hypothesis. The real question you'd like to answer is P(H|O), the probability of the hypothesis given the observation. The latter CANNOT be calculated solely from the former! You must use Bayes' theorem, which requires an estimate of the probability of the hypothesis and the probability of the observation. Without these, the p value indicates nothing, so you are correct in questioning its importance in science. – Aleksandr Dubinsky Jul 24 '15 at 03:53
  • maybe this may help: http://stats.stackexchange.com/questions/166323/misunderstanding-a-p-value/166327#166327 –  Sep 06 '15 at 15:35

9 Answers9

72

This is certainly a valid concern, but this isn't quite right.

If 1,000,000 studies are done and all the null hypotheses are true then approximately 50,000 will have significant results at p < 0.05. That's what a p value means. However, the null is essentially never strictly true. But even if we loosen it to "almost true" or "about right" or some such, that would mean that the 1,000,000 studies would all have to be about things like

  • The relationship between social security number and IQ
  • Is the length of your toes related to the state of your birth?

and so on. Nonsense.

One trouble is, of course, that we don't know which nulls are true. Another problem is the one @Glen_b mentioned in his comment - the file drawer problem.

This is why I so much like Robert Abelson's ideas that he puts forth in Statistics as Principled Argument. That is, statistical evidence should be part of a principled argument as to why something is the case and should be judged on the MAGIC criteria:

  • Magnitude: How big is the effect?
  • Articulation: Is it full of "ifs", "ands" and "buts" (that's bad)
  • Generality: How widely does it apply?
  • Interestingness
  • Credibilty: Incredible claims require a lot of evidence
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 4
    Could one even say "if 1M studies are done and _even_ if all the null hypotheses are true, then approximately 50.000 will perform type 1 error and incorrectly reject the null hypothesis? If a researcher gets p<0.05 they only know that "h0 is correct and a rare event has occurred OR h1 is incorrect". There's no way of telling which it is by only looking at the results of this one study, is there? – n_mu_sigma Jul 19 '15 at 13:31
  • 1
    Besides, having alpha at 0.05, the "rare event" will not be that rare at all. It's in in 20. Meaning if I collect 40 independent variables in the same study, I have a good chance of getting one p<0.05 due to noise - and proceed to publish the result (see the chocolate story) – n_mu_sigma Jul 19 '15 at 13:32
  • 5
    You can only get a false positive if the positive is, in fact, false. If you picked 40 IVs that were all noise, then you would have a good chance of a type I error. But generally we pick IVs for a reason. And the null is false. You can't make a type I error if the null is false. – Peter Flom Jul 19 '15 at 14:24
  • 6
    I don't understand your second paragraph, including the bullet points, at all. Let's say for the sake of argument all 1 million studies were testing drug compounds for curing a specific condition. The null hypothesis for each of these studies is that the drug does not cure the condition. So, why must that be "essentially never strictly true"? Also, why do you say all the studies would have to be about nonsensical relationships, like ss# and IQ? Thanks for any additional explanation that can help me understand your point. – Chelonian Jul 19 '15 at 18:03
  • 3
    I say the null is never strictly true because - well, because it never is. There will be _some_ relationship between just about any treatment that might be tried and just about any condition it might be tried on, unless you deliberately choose nonsense. Heck, in the whole population, there is surely _some_ relationship between IQ and SSN - it's likely very very small, but it's there. With a large enough N, it would be significant. – Peter Flom Jul 19 '15 at 20:13
  • 11
    To make @PeterFlom's examples concrete: the first three digits of an SSN (used to) encode the applicant's zip code. Since the individual states have somewhat different demographics and toe size might be correlated with some demographic factors (age, race, etc), there is almost certainly a relationship between social security number and toe size--if one has enough data. – Matt Krause Jul 19 '15 at 22:59
  • The point of the scientific method is that we should never rely on our intuition to decide whether or not questions such as 'Is there a relationship between SSN and IQ?' are nonsense. It would fly in the face of all our scientific theories if such a relationship were to exist, but the same could be said about some of the greatest experimental breakthroughs of our age. – John Gowers Jul 20 '15 at 10:00
  • 2
    It's true that we should not rely just on intuition, but that has nothing to do with my point: We do not do experiments at random. We do not choose variables at random. We do not make theories at random. – Peter Flom Jul 20 '15 at 11:24
  • 6
    @MattKrause good example. I prefer finger count by gender. I am sure if I took a census of all men and all women, I would find that one gender has more fingers on average than the other. Without taking an extremely large sample, I have no idea which gender has more fingers. Furthermore, I doubt as a glove manufacturer I would use finger census data in glove design. – emory Jul 20 '15 at 12:55
  • I would urge you to phrase your point about "MAGIC criteria" in the concrete and infallible terms of Bayes' theorem. If our goal is to estimate P(H|O) ( the probability the hypothesis is true), we simply cannot rely on P(O|H) (the p value) on its own. We need to estimate P(H) and P(O). "Incredible claims," a hypothesis full of "ifs" and "buts," or a hypothesis that is "complex," translate to a low P(H). An observation that is "not interesting" maps to a high P(O). Surely, talking about P(H) and P(O), difficult as they are to assign numeric values, would nevertheless be more "scientific." – Aleksandr Dubinsky Jul 24 '15 at 04:07
  • 1
    If we cannot assign sensible values to P(H) then it is surely more scientific to refrain from doing so. The larger point Abelson is trying to make is that statistics has to be _part_ of a principled argument, not the whole of it. – Peter Flom Jul 24 '15 at 11:18
42

Aren't all researches around the world somewhat like the "infinite monkey theorem" monkeys?

Remember, scientists are critically NOT like infinite monkeys, because their research behavior--particularly experimentation--is anything but random. Experiments are (at least supposed to be) incredibly carefully controlled manipulations and measurements that are based on mechanistically informed hypotheses that builds on a large body of previous research. They are not just random shots in the dark (or monkey fingers on typewriters).

Consider that there are 23887 universities in the world. If each university has 1000 students, that's 23 millions of students each year. Let's say that each year, each student does at least one research,

That estimate for the number of published research findings has got to be way way off. I don't know if there are 23 million "university students" (does that just include universities, or colleges too?) in the world, but I know that the vast majority of them never publishes any scientific findings. I mean, most of them are not science majors, and even most science majors never publish findings.

A more likely estimate (some discussion) for number of scientific publications each year is about 1-2 million.

Doesn't that mean that even if all the research samples were pulled from random population, about 5% of them would "reject the null hypothesis as invalid". Wow. Think of that. That's about a million research papers per year getting published due to "significant" results.

Keep in mind, not all published research has statistics where significance is right at the p = 0.05 value. Often one sees p values like p<0.01 or even p<0.001. I don't know what the "mean" p value is over a million papers, of course.

If this is how it works, this is scary. It means that a lot of the "scientific truth" we take for granted is based on pure randomness.

Also keep in mind, scientists are really not supposed to take a small number of results at p around 0.05 as "scientific truth". Not even close. Scientists are supposed to integrate over many studies, each of which has appropriate statistical power, plausible mechanism, reproducibility, magnitude of effect, etc., and incorporate that into a tentative model of how some phenomenon works.

But, does this mean that almost all of science is correct? No way. Scientists are human, and fall prey to biases, bad research methodology (including improper statistical approaches), fraud, simple human error, and bad luck. Probably more dominant in why a healthy portion of published science is wrong are these factors rather than the p<0.05 convention. In fact, let's just cut right to the chase, and make an even "scarier" statement than what you have put forth:

Why Most Published Research Findings Are False

Chelonian
  • 521
  • 3
  • 3
  • 10
    I'd say that Ioannidis is making a rigorous argument that backs up the question. Science is not done anything like as well as the optimists answering here seem to think. And a lot of published research is never replicated. Moreover, when replication is attempted, the results tend to back up the Ioannidis argument that much published science is basically bollocks. – matt_black Jul 19 '15 at 18:40
  • 9
    It may be of interest that in particle physics our p-value threshold to claim a discovery is 0.00000057. – David Z Jul 20 '15 at 07:57
  • 2
    And in many cases, there are no p values at all. Mathematics and theoretical physics are common cases. – Davidmh Jul 22 '15 at 21:43
21

Your understanding of $p$-values seems to be correct.

Similar concerns are voiced quite often. What makes sense to compute in your example, is not only the number of studies out of 23 mln that arrive to false positives, but also the proportion of studies that obtained significant effect that were false. This is called "false discovery rate". It is not equal to $\alpha$ and depends on various other things such as e.g. the proportion of nulls across your 23 mln studies. This is of course impossible to know, but one can make guesses. Some people say that the false discovery rate is at least 30%.

See e.g. this recent discussion of a 2014 paper by David Colquhoun: Confusion with false discovery rate and multiple testing (on Colquhoun 2014). I have been arguing there against this "at least 30%" estimate, but I do agree that in some fields of research false discovery rate can be a lot bit higher than 5%. This is indeed worrisome.

I don't think that saying that null is almost never true helps here; Type S and Type M errors (as introduced by Andrew Gelman) are not much better than type I/II errors.

I think what it really means, is that one should never trust an isolated "significant" result.

This is even true in high energy physics with their super-stringent $\alpha\approx 10^{-7}$ criterion; we believe the discovery of the Higgs boson partially because it fits so well to the theory prediction. This is of course much much MUCH more so in some other disciplines with much lower conventional significance criteria ($\alpha=0.05$) and lack of very specific theoretical predictions.

Good studies, at least in my field, do not report an isolated $p<0.05$ result. Such a finding would need to be confirmed by another (at least partially independent) analysis, and by a couple of other independent experiments. If I look at the best studies in my field, I always see a whole bunch of experiments that together point at a particular result; their "cumulative" $p$-value (that is never explicitly computed) is very low.

To put it differently, I think that if a researcher gets some $p<0.05$ finding, it only means that he or she should go and investigate it further. It definitely does not mean that it should be regarded as "scientific truth".

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Re "cumulative p values": Can you just multiply the individual p values, or do you need to do some monstrous combinatorics to make it work? – Kevin Jul 20 '15 at 23:02
  • @Kevin: one can multiply individual $p$-values, but one needs to adapt the significance threshold $\alpha$. Think of 10 random $p$-values uniformly distributed on [0,1] (i.e. generated under null hypothesis); their product will most likely be below 0.05, but it would be nonsense to reject the null. Look for Fisher's method of combining p-values; there's a lot of threads about it here on CrossValidated too. – amoeba Jul 22 '15 at 09:44
17

Your concern is exactly the concern that underlies a great deal of the current discussion in science about reproducability. However, the true state of affairs is a bit more complicated than you suggest.

First, let's establish some terminology. Null hypothesis significance testing can be understood as a signal detection problem -- the null hypothesis is either true or false, and you can either choose to reject or retain it. The combination of two decisions and two possible "true" states of affairs results in the following table, which most people see at some point when they're first learning statistics:

enter image description here

Scientists who use null hypothesis significance testing are attempting to maximize the number of correct decisions (shown in blue) and minimize the number of incorrect decisions (shown in red). Working scientists are also trying to publish their results so that they can get jobs and advance their careers.

Of course, bear in mind that, as many other answerers have already mentioned, the null hypothesis is not chosen at random -- instead, it is usually chosen specifically because, based on prior theory, the scientist believes it to be false. Unfortunately, it is hard to quantify the proportion of times that scientists are correct in their predictions, but bear in mind that, when scientists are dealing with the "$H_0$ is false" column, they should be worried about false negatives rather than false positives.


You, however, seem to be concerned about false positives, so let's focus on the "$H_0$ is true" column. In this situation, what is the probability of a scientist publishing a false result?

Publication bias

As long as the probability of publication does not depend on whether the result is "significant", then the probability is precisely $\alpha$ -- .05, and sometimes lower depending on the field. The problem is that there is good evidence that the probability of publication does depend on whether the result is significant (see, for example, Stern & Simes, 1997; Dwan et al., 2008), either because scientists only submit significant results for publication (the so-called file-drawer problem; Rosenthal, 1979) or because non-significant results are submitted for publication but don't make it through peer review.

The general issue of the probability of publication depending on the observed $p$-value is what is meant by publication bias. If we take a step back and think about the implications of publication bias for a broader research literature, a research literature affected by publication bias will still contain true results -- sometimes the null hypothesis that a scientist claims to be false really will be false, and, depending on the degree of publication bias, sometimes a scientist will correctly claim that a given null hypothesis is true. However, the research literature will also be cluttered up by too large a proportion of false positives (i.e., studies in which the researcher claims that the null hypothesis is false when really it's true).

Researcher degrees of freedom

Publication bias is not the only way that, under the null hypothesis, the probability of publishing a significant result will be greater than $\alpha$. When used improperly, certain areas of flexibility in the design of studies and analysis of data, which are sometimes labeled researcher degrees of freedom (Simmons, Nelson, & Simonsohn, 2011), can increase the rate of false positives, even when there is no publication bias. For example, if we assume that, upon obtaining a non-significant result, all (or some) scientists will exclude one outlying data point if this exclusion will change the non-significant result into a significant one, the rate of false positives will be greater than $\alpha$. Given the presence of a large enough number of questionable research practices, the rate of false positives can go as high as .60 even if the nominal rate was set at .05 (Simmons, Nelson, & Simonsohn, 2011).

It's important to note that the improper use of researcher degrees of freedom (which is sometimes known as a questionable research practice; Martinson, Anderson, & de Vries, 2005) is not the same as making up data. In some cases, excluding outliers is the right thing to do, either because equipment fails or for some other reason. The key issue is that, in the presence of researcher degrees of freedom, the decisions made during analysis often depend on how the data turn out (Gelman & Loken, 2014), even if the researchers in question are not aware of this fact. As long as researchers use researcher degrees of freedom (consciously or unconsciously) to increase the probability of a significant result (perhaps because significant results are more "publishable"), the presence of researcher degrees of freedom will overpopulate a research literature with false positives in the same way as publication bias.


An important caveat to the above discussion is that scientific papers (at least in psychology, which is my field) seldom consist of single results. More common are multiple studies, each of which involves multiple tests -- the emphasis is on building a larger argument and ruling out alternative explanations for the presented evidence. However, the selective presentation of results (or the presence of researcher degrees of freedom) can produce bias in a set of results just as easily as a single result. There is evidence that the results presented in multi-study papers is often much cleaner and stronger than one would expect even if all the predictions of these studies were all true (Francis, 2013).


Conclusion

Fundamentally, I agree with your intuition that null hypothesis significance testing can go wrong. However, I would argue that the true culprits producing a high rate of false positives are processes like publication bias and the presence of researcher degrees of freedom. Indeed, many scientists are well aware of these problems, and improving scientific reproducability is a very active current topic of discussion (e.g., Nosek & Bar-Anan, 2012; Nosek, Spies, & Motyl, 2012). So you are in good company with your concerns, but I also think there are also reasons for some cautious optimism.


References

Stern, J. M., & Simes, R. J. (1997). Publication bias: Evidence of delayed publication in a cohort study of clinical research projects. BMJ, 315(7109), 640–645. http://doi.org/10.1136/bmj.315.7109.640

Dwan, K., Altman, D. G., Arnaiz, J. A., Bloom, J., Chan, A., Cronin, E., … Williamson, P. R. (2008). Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS ONE, 3(8), e3081. http://doi.org/10.1371/journal.pone.0003081

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. http://doi.org/10.1037/0033-2909.86.3.638

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. http://doi.org/10.1177/0956797611417632

Martinson, B. C., Anderson, M. S., & de Vries, R. (2005). Scientists behaving badly. Nature, 435, 737–738. http://doi.org/10.1038/435737a

Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102, 460-465.

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153–169. http://doi.org/10.1016/j.jmp.2013.02.003

Nosek, B. A., & Bar-Anan, Y. (2012). Scientific utopia: I. Opening scientific communication. Psychological Inquiry, 23(3), 217–243. http://doi.org/10.1080/1047840X.2012.692215

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631. http://doi.org/10.1177/1745691612459058

Patrick S. Forscher
  • 3,122
  • 23
  • 43
  • 1
    +1. Nice collection of links. Here is one very relevant paper for your "Researcher degrees of freedom" section: [The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time](http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf) by Andrew Gelman and Eric Loken (2013). – amoeba Jul 21 '15 at 18:11
  • Thanks, @amoeba, for that interesting reference. I especially like the point that Gelman and Loken (2013) make that capitalizing on researcher degrees of freedom need not be a conscious process. I've edited my answer to include that paper. – Patrick S. Forscher Jul 21 '15 at 18:39
  • I just found the published version of Gelman & Loken (2014) in American Scientist. – Patrick S. Forscher Jul 21 '15 at 18:47
10

A substantial check on the important issue raised in this question is that "scientific truth" is not based on individual, isolated publications. If a result is sufficiently interesting it will prompt other scientists to pursue the implications of the result. That work will tend to confirm or refute the original finding. There might be a 1/20 chance of rejecting a true null hypothesis in an individual study, but only a 1/400 of doing so twice in a row.

If scientists did simply repeat experiments until they find "significance" and then published their results the problem might be as large as the OP suggests. But that's not how science works, at least in my nearly 50 years of experience in biomedical research. Furthermore, a publication is seldom about a single "significant" experiment but rather is based on a set of inter-related experiments (each required to be "significant" on its own) that together provide support for a broader, substantive hypothesis.

A much larger problem comes from scientists who are too committed to their own hypotheses. They then may over-interpret the implications of individual experiments to support their hypotheses, engage in dubious data editing (like arbitrarily removing outliers), or (as I have seen and helped catch) just make up the data.

Science, however, is a highly social process, regardless of the mythology about mad scientists hiding high up in ivory towers. The give and take among thousands of scientists pursuing their interests, based on what they have learned from others' work, is the ultimate institutional protection from false positives. False findings can sometimes be perpetuated for years, but if an issue is sufficiently important the process will eventually identify the erroneous conclusions.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • 7
    The $1/4000$ estimate may be misleading. If one is in the business of repeating experiments until achieving "significance" and then publishing, then the expected number of experiments needed to publish an initial "significant" result and to follow it up with a second "significant" result is only $40$. – whuber Jul 19 '15 at 13:25
  • 2
    Out of 23M studies, we still couldn't tell if 5.000 results reject null hypothesis only due to noise, could we? It really is also a problem of scale. Once you have millions of researches, type 1 error will be common. – n_mu_sigma Jul 19 '15 at 13:34
  • 3
    If there were only 5000 erroneous conclusions out of 23,000,000 studies I would call that *uncommon* indeed! – whuber Jul 19 '15 at 14:39
  • 3
    In nearly 50 years of doing science and knowing other scientists, I can't think of any who repeated experiments until they achieved "significance." The theoretical possibility raised by @whuber is, in my experience, not a big practical problem. The much bigger practical problem is making up data, either indirectly by throwing away "outliers" that don't fit a preconception, or by just making up "data" to start with. Those behaviors I have seen first hand, and they can't be fixed by adjusting _p_-values. – EdM Jul 19 '15 at 14:53
  • 3
    @EdM "There might be a 1/20 chance of rejecting a true null hypothesis in an individual study, but only a 1/4000 of doing so twice in a row." How did you get the second number? – Aksakal Jul 21 '15 at 18:49
  • @whuber about *repeating experiments until achieving "significance"* oh yes, cf. http://www.explainxkcd.com/wiki/index.php/882 – Stéphane Gourichon Jul 22 '15 at 06:45
  • 1
    @Aksakal I got that second number by typographical error when trying to type an answer on an iPad. I meant 1/400 (1/20 x 1/20) and have now fixed that in my edited answer, which now incorporates some of the commentary – EdM Jul 22 '15 at 19:43
5

Just to add to the discussion, here is an interesting post and subsequent discussion about how people are commonly misunderstanding p-value.

What should be retained in any case is that a p-value is just a measure of the strength of evidence in rejecting a given hypothesis. A p-value is definitely not a hard threshold below which something is "true" and above which it is only due to chance. As explained in the post referenced above:

results are a combination of real effects and chance, it’s not either/or

Antoine
  • 5,740
  • 7
  • 29
  • 53
  • maybe this will contribute to the understanding of p-values: http://stats.stackexchange.com/questions/166323/misunderstanding-a-p-value/166327#166327 –  Sep 06 '15 at 15:36
4

As also pointed out in the other answers, this will only cause problems if you are going to selectively consider the positive results where the null hypothesis is ruled out. This is why scientists write review articles where they consider previously published research results and try to develop a better understanding of the subject based on that. However, there then still remains a problem, which is due to the so-called "publication bias", i.e. scientists are more likely to write up an article about a positive result than on a negative result, also a paper on a negative result is more likely to get rejected for publication than a paper on a positive result.

Especially in fields where statistical test are very important will this be a big problem, the field of medicine is a notorious example. This is why it was made compulsory to register clinical trials before they are conducted (e.g. here). So, you must explain the set up, how the statistical analysis is going to be performed, etc. etc. before the trial gets underway. The leading medical journals will refuse to publish papers if the trials they report on where not registered.

Unfortunately, despite this measure, the system isn't working all that well.

Count Iblis
  • 311
  • 1
  • 4
  • maybe this will contribute to the understanding of p-values: http://stats.stackexchange.com/questions/166323/misunderstanding-a-p-value/166327#166327 –  Sep 06 '15 at 15:37
3

This is close to a very important fact about the scientific method: it emphasizes falsifiability. The philosophy of science which is most popular today has Karl Popper's concept of falsifiability as a corner stone.

The basic scientific process is thus:

  • Anyone can claim any theory they want, at any time. Science will admit any theory which is "falsifiable." The most literal sense of that word is that, if anyone else doesn't like the claim, that person is free to spend the resources to disprove the claim. If you don't think argyle socks cure cancer, you are free to use your own medical ward to disprove it.

  • Because this bar for entry is monumentally low, it is traditional that "Science" as a cultural group will not really entertain any idea until you have done a "good effort" to falsify your own theory.

  • Acceptance of ideas tends to go in stages. You can get your concept into a journal article with one study and a rather low p-value. What that does buy you is publicity and some credibility. If someone is interested in your idea, such as if your science has engineering applications, they may want to use it. At that time, they are more likely to fund an additional round of falsification.

  • This process goes forward, always with the same attitude: believe what you want, but to call it science, I need to be able to disprove it later.

This low bar for entry is what allows it to be so innovative. So yes, there are a large number of theoretically "wrong" journal articles out there. However, the key is that every published article is in theory falsifiable, so at any point in time, someone could spend the money to test it.

This is the key: journals contain not only things which pass a reasonable p-test, but they also contain the keys for others to dismantle it if the results turn out to be false.

Cort Ammon
  • 547
  • 2
  • 5
  • 1
    This is very idealistic. Some people are concerned that too many wrong papers can create too low signal-to-noise ratio in the literature and seriously slow down or misguide the scientific process. – amoeba Jul 20 '15 at 15:22
  • 1
    @amoeba You do bring up a good point. I certainly wanted to capture the ideal case because I find it is oft lost in the noise. Beyond that, I think the question of SNR in the literature is a valid question, but at least it is one that should be balancable. There's already concepts of good journals vs poor journals, so there's some hints that that balancing act has been underway for some time. – Cort Ammon Jul 20 '15 at 16:07
  • This grasp of the philosophy of science seems to be several decades out of date. Popperian falsifiability is only "popular" in the sense of being a *common* urban myth about how science happens. – 410 gone Jul 21 '15 at 17:24
  • @EnergyNumbers Could you enlighten me on the new way of thinking? The philosophy SE has a very different opinion from yours. If you look at the question history over there, Popperian falsifiability is *the* defining characteristic of science for the majority of those who spoke their voice. I'd love to learn a newer way of thinking and bring it over there! – Cort Ammon Jul 21 '15 at 17:56
  • New? Kuhn refuted Popper decades ago. If you've got no one post Popperian on philosophy.se, then updating it would seem to be a lost cause - just leave it in the 1950s. If you want to update yourself, then any undergraduate primer from the 21st-century on the philosophy of science should get you started. – 410 gone Jul 21 '15 at 18:32
  • @EnergyNumbers Thank you very much for that link. I was quoting the party line regarding falsifiability of science because it was the only consensus I had seen. As it turns out, I have been arguing Kuhn's position, to the letter, for the last 3 years in wide communities (including scientific) and usually found hostility towards my opinions. You are the first to point out that my ideas were not new! Thanks! – Cort Ammon Jul 21 '15 at 20:01
  • Its surprising how one can believe Kuhn completely refutes Popperian philosophy, and yet his views are still considered heretical in many encampments. – Cort Ammon Jul 21 '15 at 20:03
  • which just justifies what Kuhn said of course, in some ways. Lots of people came after Kuhn. Bas van Fraassen might be your next thing to read. Check in with Lakatos too (earlier but interesting). And Feyerabend. – 410 gone Jul 21 '15 at 21:21
  • This is all about how to distinguish good signal from noise. @Cort_Ammon raises a good point. I "up" this answer. This is so meta. – Stéphane Gourichon Jul 22 '15 at 06:49
  • @EnergyNumbers economists still namedrop Popper all the time. – shadowtalker Jul 22 '15 at 14:05
1

Is this how "science" is supposed to work?

That's how a lot of social sciences work. No so much with physical sciences. Think of this: you typed your question on a computer. People were able to build these complicated beasts called computers using the knowledge of physics, chemistry and other fields of physical sciences. If the situation was as bad as you describe, none of the electronics would work. Or think of the things like a mass of an electron, which is known with insane precision. They pass through billions of logic gates in a computer over an over, and your computer still works and works for years.

UPDATE: To respond to the down votes I received, I felt inspired to give you a couple of examples.

The first one is from physics: Bystritsky, V. M., et al. "Measuring the astrophysical S factors and the cross sections of the p (d, γ) 3He reaction in the ultralow energy region using a zirconium deuteride target." Physics of Particles and Nuclei Letters 10.7 (2013): 717-722.

As I wrote before, these physicist don't even pretend doing any statistics beyond computing the standard errors. There's a bunch of graphs and tables, not a single p-value or even confidence interval. The only evidence of statistics is the standard errors notes as $0.237 \pm 0.061$, for instance.

My next example is from... psychology: Paustian-Underdahl, Samantha C., Lisa Slattery Walker, and David J. Woehr. "Gender and perceptions of leadership effectiveness: A meta-analysis of contextual moderators." Journal of Applied Psychology, 2014, Vol. 99, No. 6, 1129 –1145.

These researchers have all the usual suspects: confidence intervals, p-values, $\chi^2$ etc.

Now, look at some tables from papers and guess which papers they are from:

enter image description here enter image description here

That's the answer why in one case you need "cool" statistics and in another you don't: because the data is either crappy or not. When you have good data, you don't need much stats beyond standard errors.

UPDATE2: @PatrickS.Forscher made an interesting statement in the comment:

It is also true that social science theories are "softer" (less formal) than physics theories.

I must disagree. In Economics and Finance the theories are not "soft" at all. You can randomly lookup a paper in these fields and get something like this:

enter image description here

and so on.

It's from Schervish, Mark J., Teddy Seidenfeld, and Joseph B. Kadane. "Extensions of expected utility theory and some limitations of pairwise comparisons." (2003). Does this look soft to you?

I'm re-iterating my point here that when your theories are not good and the data is crappy, you can use the hardest math and still get a crappy result.

In this paper they're talking about utilities, the concept like happiness and satisfaction - absolutely unobservable. It's like what is a utility of having a house vs. eating a cheeseburger? Presumably there's this function, where you can plug "eat cheeseburger" or "live in own house" and the function will spit out the answer in some units. As crazy as it sounds this is what modern ecnomics is built on, thank to von Neuman.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • 1
    +1 Not sure why this was downvoted twice. You are basically pointing out that discoveries in physics can be tested with experiments, and most "discoveries" in the social sciences can't be, which doesn't stop them getting plenty of media attention. – Flounderer Jul 20 '15 at 04:53
  • 6
    Most experiments ultimately involve some sort of statistical test and still leave room for type 1 errors and misbehaviours like p-value fishing. I think that singling out the social sciences is a bit off mark. – Kenji Jul 20 '15 at 11:23
  • 1
    Actually, most experiments don't have any statistical tests. They often mention the standard deviation of errors only. – Aksakal Jul 20 '15 at 11:42
  • 1
    @Flounderer, some areas in physical sciences suffer from the same issue, e.g. climate science. It's usually the case with observational studies or when the experiments are very expensive, such as in cosmology. – Aksakal Jul 20 '15 at 12:20
  • 4
    To amend a bit what @GuilhermeKenjiChihaya is saying, the standard deviation of the errors could presumably used to perform a statistical test in physical experiments. Presumably this statistical test would come to the same conclusion that the authors reach upon viewing the graph with its error bars.The main difference with physics papers, then, is the underlying amount of noise in the experiment, a difference that is independent of whether the logic underlying the use of p-values is valid or invalid. – Patrick S. Forscher Jul 21 '15 at 20:02
  • 3
    Also, @Flounderer, you seem to be using the term "experiment" in a sense with which I am unfamiliar, as social scientists do "experiments" (i.e., randomization of units to conditions) all the time. It is true that social science experiments are difficult to control to the same degree that is present as in physics experiments. It is also true that social science theories are "softer" (less formal) than physics theories. But these factors are independent of whether a given study is an "experiment". – Patrick S. Forscher Jul 21 '15 at 20:06
  • @PatrickS.Forscher updated my answer – Aksakal Jul 22 '15 at 02:38
  • @PatrickS.Forscher You are right. I mean in the sense of "testing something out in the real world and seeing if it works", not in the sense of statistics. – Flounderer Jul 22 '15 at 03:29
  • 1
    It's unusual, I'd say, to use the difficulty of math being used as a criterion for hard vs soft. There's been a purposeful trend by economists towards using more advanced mathematics, and I wager part of the motivation is to be seen as "hard". I don't think that's worked out too well, since excessive use of mathematics is easily an indicator of bogosity or an intimidation tactic, as a way of obscuring that you aren't doing much. From that viewpoint, economics is not even soft, but a pseudo-science. – Chan-Ho Suh Jul 22 '15 at 04:45
  • 1
    Incidentally, I don't believe any of the authors of the paper you quote from self-identify as economist. Seidenfield is in the philosophy and statistics departments and the other two are in the statistics department. That doesn't mean their work doesn't lie in the realm of economics, but I do find it interesting that your main example of how "hard" economics is, relies on a paper by statisticians and a philosopher studying decision theory. I doubt critics of economics as a hard science are really attacking such areas heavily overlapping with computer science, math, and statistics. – Chan-Ho Suh Jul 22 '15 at 04:53
  • 2
    @Aksakal while I disagree with -1's, I also partly disagree with your critic of social sciences. Your example of economic paper is also not a good example of what social scientists do on daily basis because the utility theory is a strictly economical/mathematical/statistical concept (so it already *has* math in it) and it does not resemble e.g. psychological theories that are tested experimentally... However I agree that it is often the case that statistics are used loosely in many areas of research, including social sciences. – Tim Jul 22 '15 at 07:15
  • The fact that social science theories _aren't_ softer is something that has bothered me for years and ultimately got me to avoid the economics PhD I thought I had wanted. – shadowtalker Jul 22 '15 at 14:04
  • @ssdecontrol "The word “model” sounds more scientific than “fable” or “fairy tale”, but I don’t see much difference between them" from Ariel Rubinstein's [notes](http://arielrubinstein.tau.ac.il/Rubinstein2007.pdf) on micro – Aksakal Jul 22 '15 at 14:35
  • @Aksakal I've heard that sentiment expressed in a few places. I'm sure Rubinstein believes it, but a lot of the time it came across to as an instance of "do as I say, not as I do." – shadowtalker Jul 22 '15 at 14:45
  • 1
    Every interval is definable in terms of a test and a p-value and every test and p-value can be (and often is) used to define an interval. So the idea that one is somehow avoiding the 'statistical' problems of tests and p-values by looking at 'non-statistical' intervals instead is just silly. – conjugateprior Jul 26 '15 at 12:14
  • 1
    And the idea that "there's this function, where you can plug "eat cheeseburger" or "live in own house" and the function will spit out the answer in some units" is "what modern ecnomics is built on, thank to von Neuman." is just false. For von Neumann and Morgenstern each rational individual will act as though they *each* had such a function. But not only are those functions just analytic constructions from observed choices - so *not* necessarily anything like "happiness or satisfaction" - but the functions won't even be comparable between individuals. – conjugateprior Jul 26 '15 at 12:49
  • @conjugateprior, you're wrong to say that p-value are defined for every interval. You need a distributional assumption for that. In the example paper from physics there's nothing about probabilities, and that was my assertion: a vast majority of physics research doesn't bother about probabilities. – Aksakal Jul 26 '15 at 14:00
  • 1
    Accepting purely for the sake of argument that "the only evidence of statistics is the standard errors" do you really believe that no distributional assumption has been made when you compute a standard error? – conjugateprior Jul 26 '15 at 17:19
  • @conjugateprior, it's the square root of the variance. Do you have other definitions? – Aksakal Jul 26 '15 at 17:24
  • Oh, I don't know, maybe this one: http://mathworld.wolfram.com/StandardError.html or this one https://en.wikipedia.org/wiki/Standard_error – conjugateprior Jul 26 '15 at 17:29
  • Also, it's not "the square root of the variance". That's the standard deviation you're thinking of. – conjugateprior Jul 26 '15 at 17:33
  • @conjugateprior, the same thing. Social scientists love little details about statistics because their data is crappy and "experiments" are not repeatable. In vast majority of physics research it doesn't matter whether you use biased or unbiased variance estimator. There's never a discussion of these details, they talk about substance. Crappier the data - more stats you need, that's why social scientist are so more educated in stats. – Aksakal Jul 26 '15 at 17:37