Should I trust the $p$-value in statistical tests?

Question

I got into a debate with my supervisor over a recent paper. The test of correlation in a sample of 77 participants yielded a p-value smaller than 0.05. After removing a few participants (because later we found out they are underaged), the p-value is 0.06 (r = 0.21).

Then my supervisor says, 'you should report there are no correlations between these two variables, the p-value is not significant.'

Here's what I reply: It makes no sense to tell people that the result is not significant in a sample of 71, but it’s significant in a sample of 77. It is important to link the results to the findings in the literature when interpreting a trend. Although we find a weak trend here, this trend aligns with numerous studies in the literature that finds significant correlations in these two variables.

Here is what my supervisor reply: I would argue the other way: If it’s no longer significant in the sample of 71, it’s too weak to be reported. If there is a strong signal, we will see it in the smaller sample, as well.

Shall I not report this 'not significant' result?

Why not writing a statistical analysis plan before looking into the data? It saves so much time and makes research so much more honest. — Michael M, Oct 09 '19 at 19:24
I don't see why your supervisor would expect that the p-value should not increase when you remove some points. If you remove the "right" points, in some cases you could push p-values around dramatically (perhaps even from well below 0.01 to far above 0.10 in some cases). Why were the data removed? — Glen_b, Oct 10 '19 at 00:52
More interesteing than whether the p-value is 0.05 or 0.06 (which makes not much of a difference) is what the actual value of the correlation is. "(no) correlation" does not mean "p-value lower (greater) than 0.05", but whether |r| is close to one. — cdalitz, Oct 10 '19 at 14:28
The correlation is 0.21, other studies find a similar correlation between these two variables, including my own study on a much larger dataset. — Lucia, Oct 10 '19 at 14:49
I don't think your professor's comments make much sense. Even if we take p = 0.05 as a magic value (which, as many people point out, its not), that just means that 77 participants was _barely_ enough to get a significant result (assuming that the six people you removed were randomly selected). Though, as many people point out... why did you remove six people? The only justification I can think of is that maybe you did have a plan of attack ahead of time and it turns out that due to some otherwise harmless mistake six people got into the data set that shouldn't have. — roundsquare, Oct 10 '19 at 22:49
@cdalitz Not sure where you are getting your math from: $|r| \approx 1$ is nearly *perfect* correlation. An $|r| \approx 0$ indicates "no correlation." — Alexis, Oct 11 '19 at 00:23
Admittedly not my field of research, but have you tried visualizing simulated correlations of $\rho=0.21$? This is hardly something to write home about. — Frans Rodenburg, Oct 11 '19 at 01:48
When the correlation is $r=0.21$, the p-value is irrelevant (a larger value will only make the confidence intervall slightly wider). Irrespective of the p-value, this means that there is only a very weak correlation. The rule of thumb given in many math textbooks, is that values $|r|>0.5$ are considered as "correlation", with values $|r|>0.8$ considered as "strong correlation". I still do not understand why you consider the p-value at all, or why it should be of any relevance in this case. — cdalitz, Oct 11 '19 at 08:37
The people we removed are under 18 years old, so they cannot be participants in the study as we later found out. For me, I would report the result is p = 0.06 and say this trend aligns with lots of studies that found the two variables are slightly correlated. But my supervisor reports it as, "contradicted to the findings in many studies, we find that they are not correlated". I think this is a bad idea. — Lucia, Oct 11 '19 at 09:20
this time you got unlucky I would be terrified if probability of some event is 0.5 and all 100 scientists would report success and 0 fails... — quester, Oct 21 '19 at 14:32

mkt · Answer 1 · 2019-10-10T16:44:04.460

14

For the purpose of this answer I'm going to assume that excluding those few participants was fully justified, but I agree with Patrick that this is a concern.

There's no meaningful difference between p ~ 0.05 or p = 0.06. The only difference here is that the convention is to treat the former as equivalent to 'true' and the latter as equivalent to 'false'. This convention is terrible and is unjustifiable. The debate between you and your professor amounts to how to form a rule of thumb to deal with the arbitrariness of the p = 0.05 boundary. In a saner world, we would not put quite so much stock into tiny fluctuations of a sample statistic.

Or to put it more colourfully:

...surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?”

-Rosnow, R.L. & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284.

So go ahead and report that p = 0.06. The number itself is fine, it's how it is subsequently described and interpreted that is important. Keep in mind that 'significant' and 'non-significant' are misleading terms. You will have to go beyond them to describe your results accurately.

Furthermore, I recommend you read the answers to What is the meaning of p values and t values in statistical tests?

edited Oct 10 '19 at 16:44

answered Oct 09 '19 at 18:51

mkt

11,770
9
51
125

What decision rule do you use to "treat [some quantity] as equivalent to 'true' and [some quantity] as equivalent to 'false'?" If you do *not* have such a decision rule, how do you provide evidence for or against any truth claim in the sciences? [Relevant](https://stats.stackexchange.com/questions/204843/is-this-the-solution-to-the-p-value-problem) – Alexis Oct 11 '19 at 00:26
1

@Alexis Since it's not entirely clear if you're objecting to my answer, I'll answer for future readers: (1) Accumulating evidence for or against a claim does not require a binary decision rule. The continuous p-value provides (some) evidence - but there's absolutely no reason for a hard boundary at 0.05. Most Bayesians manage fine without such a threshold. – mkt Oct 11 '19 at 06:28
1

(2) Outside of maths, nothing is ever 'proven'. We accumulate evidence for or against propositions. Describing this precisely is ugly, so we default to language that ignores the fine details of what a study actually identifies/establishes. Somewhere along the chain, usually due to poor teaching, the nuance gets lost. Confusion creeps in. And we end up with the 'p<0.05 equals truth' fallacy again. Instead, we need to remember that no study establishes what is true or false: it merely adds evidence for or against propositions. So whatever your results are, what matters is the degree to which... – mkt Oct 11 '19 at 06:32
1

they add evidence (for or against). Which is a *continuous* function and not a binary one. (3) Sometimes you do need a binary decision rule (e.g. should I sell object A or object B?). But a careful analyst would always take into account additional information, including the costs, benefits and prior information (and not necessarily in the formal Bayesian sense). The false binary at p = 0.05 is NOT the only way to make a binary decision rule. It ignores a lot of valuable information. – mkt Oct 11 '19 at 06:39
No amount of arguing that there are continuous measures of evidence (something I do not disagree with) makes a valid argument that **scientists (and humans generally) must also have *decision rules***. "a careful analyst would always take into account additional information, including the costs, benefits and prior information" My question remains: using what decision rule? – Alexis Oct 11 '19 at 14:16
@Alexis I'm not sure I understand you. Are you saying that there should be a universal criterion/decision rule? If so, I disagree. Costs and benefits differ between circumstances. A decision rule should be tailored to the problem at hand. – mkt Oct 11 '19 at 14:32
I said no such thing about "universal criterion" (which is a disingenuous reading of hypothesis tests: various $\alpha$ and $\delta$ can be employed with them, as can [TOST](https://stats.stackexchange.com/tags/tost/info), not to mention different kinds of test statistic corresponding to different kinds of variables, distributions, and study designs). I *did* ask for what you propose as an alternative form of decision rule, and you have not offered any. – Alexis Oct 11 '19 at 15:48
1

@Alexis I believe I have answered that already in my previous comment: "Costs and benefits differ between circumstances. A decision rule should be tailored to the problem at hand". I would not use the same rule when deciding whether to switch toothbrush brands as when deciding whether to get a limb amputated. Your position on this is far less clear, since you have failed to clarify when asked. – mkt Oct 11 '19 at 16:16
@Alexis And I don't think the disingenuous reading is on my part. I've made it plain in the answer that the problem I have is with the misuse of p-values and hypothesis tests. I encouraged the OP to report the p = 0.06 result and be careful in interpreting what a p-value means. I did not tell them to avoid p-values altogether (although this is a reasonable position). I *do* have an issue with p-value use as decision criteria without consideration of other important factors, which I see as a very common problem. Hopefully that clarifies this, though it appears we will not come to an agreement. – mkt Oct 11 '19 at 16:19
2

(+1), The Gelman quote comes to mind "the difference between significant and not significant is not itself statistically significant". – knrumsey Oct 11 '19 at 20:02

Ben · Answer 2 · 2020-02-09T21:25:54.227

There are an awful lot of issues raised in your question, so I will try to give answers on each of the issues you raise. To frame some of these issues clearly, it is important to note at the outset that a p-value is a continuous measure of evidence against the null hypothesis (in favour of the stated alternative), but when we compare it to a stipulated significance level to give a conclusion of "statistical significance" we are dichotomising that continuous measure of evidence into a binary measure.

It makes no sense to tell people that the result is not significant in a sample of 71, but it’s significant in a sample of 77.

You need to decide which of those two is actually the appropriate sample ---i.e., is it appropriate to remove six data points from your data. For reasons explained many times on this site (e.g., here and here) it is a bad idea to remove "outliers" that are not due to incorrect recording of observations. So, unless you have reason to believe this is the case, it is probably appropriate to use all 77 data points, in which case it makes no sense to say anything about the cherry-picked subsample of 71 data points.

Note here that the problem is nothing to do with the issue of statistical significance. It makes perfect sense that the outcome of different hypothesis tests (e.g., the same test on different data) could differ, and so there is no reason to regard it as problematic that there would be statistically significant evidence for the alternative hypothesis in one case, but not in the other. This is a natural consequence of having a binary outcome obtained by drawing a line of "significance" in a continuous measure of evidence.

It is important to link the results to the findings in the literature when interpreting a trend. Although we find a weak trend here, this trend aligns with numerous studies in the literature that finds significant correlations in these two variables.

If this is something you want to do, then the appropriate exercise is to do a meta-analysis to take account of all the data in the literature. The mere fact that there is other literature with other data/evidence is not a justification for treating the data in this paper any differently than you otherwise would. Do your data analysis on the data in your own paper. If you are concerned that your own result is an aberration from the literature, then note this other evidence. You can then either do a proper meta-analysis where all the data (yours and the other literature) is taken into account, or you can at least alert your reader to the scope of the available data.

Here is what my supervisor reply: I would argue the other way: If it’s no longer significant in the sample of 71, it’s too weak to be reported. If there is a strong signal, we will see it in the smaller sample, as well. Shall I not report this 'not significant' result?

Choosing not to report data because the statistical results differ from other literature is a terrible, horrible, statistically-bankrupt practice. There is a ton of literature in statistical theory warning of the problem of publication bias that occurs when researchers allow the outcome of their statistical tests to affect their choice to report/publish their data. Indeed, publication bias due to publication decisions being made on the basis of p-values is the bane of the scientific literature. It is probably one of the biggest problems in scientific and academic practice.

Regardless of how "weak" the evidence for the alternative hypothesis, the data you have collected contains information that should be reported/published. It adds 77 data points to the literature, for whatever that is worth. You should report your data and report the p-value for your test. If this does not constitute statistically significant evidence of the effect under study, then so be it.

(+1). I suspect you wanted to link to two posts when you wrote "(e.g., here and here)" but forgot to add the hyperlinks? — COOLSerdash, Oct 11 '19 at 17:08

score 5 · Answer 3 · answered Oct 09 '19 at 18:45

5

In general changing the data that went into a test invalidates the use of hypothesis testing to find significant effects. If you start editing the data and rerunning the test to see what changes you can come up with almost any result that you wish. Imagine what would happen if you removed 6 participants and it made your finding more significant. I would strongly recommend reading this: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf because it has a great discussion of the issues that can arise when analysis decisions are being made after seeing the data and the fact that that invalidates the usual interpretation of p-values.

So my question in this case is as follows: What is the motivation behind removing these participants? Was it purely based on the outcome metric (i.e., those 6 participants had the strongest effect)? Or was there some reason intrinsic to those participants (failed to complete the tasks correctly, didn't meet entry requirements, etc)?

In order to use p-values to discuss significance those decisions should have been made prior to running your statistical test and not after. So I would report the results with the 77 participants as you originally did it and ignore your supervisors comments.

I just want to reiterate here: it is not true that a smaller sample has to show the same effect if you are making the inclusion/exclusion decisions based on seeing the data.

answered Oct 09 '19 at 18:45

Patrick

1,369
8
14

1

The people we removed are under 18 years old, so they cannot be participants in the study as we later found out. For me, I would report the result is p = 0.06 and say this trend aligns with lots of studies that found the two variables are significantly correlated. But my supervisor reports it as, "contradicted to the findings in many studies, we find that they are not correlated". I think this is a bad idea. – Lucia Oct 09 '19 at 21:18
3

I agree with you and I take issue with the word "contradicted" and such a close p-value. "Contradicted" is a strong claim to make. Any study that you run individually is noisy and you have already seen the effects that removing a few participants has on the results. This is an issue with making a yes/no decision at a single cut off value. I agree with mkt's answer below that you should describe your results fully and go beyond just the "significant or not" language. – Patrick Oct 09 '19 at 21:41
3

@Lucia If you want to test whether your study contradicts the literature, you would want to test whether the slope between your X and Y are different from the slope reported in the literature, not see whether the slope is significant in both your study and the literature. – Bryan Krause Oct 10 '19 at 16:51
1

@Lucia Bryan Krause makes an important point that also relates to what I wrote about p-values. Whether the p-value is 0.04 or 0.06 does NOT tell you whether it agrees with or contradicts previous studies! Correlated/uncorrelated is treating this situation as binary when it is not. I strongly recommend reading more about what p-values mean exactly; it could help you avoid substantial statistical errors. – mkt Oct 10 '19 at 18:19
1

@Lucia this is critical information that changes the question significantly. If the study population *was incorrect*, you have to apply exclusions. However, this still means you have an underpowered study. The correct interpretation is not "the data are not correlated" but rather "we failed to demonstrate a correlation." – AdamO Oct 10 '19 at 22:22
@AdamO "we failed to demonstrate a correlation." might be read in the wrong way as "we failed to demonstrate a correlation (hence, there is additional evidence that there is no correlation)". Possibly it could be formulated as "we failed to demonstrate a correlation *above a level of x*" (where 'x' needs to be replaced with the value specified for the experiment, and it could be that this value is too high, not so much impactful data, to make sense to publish the work). – Sextus Empiricus Oct 10 '19 at 22:39

Sextus Empiricus · Answer 4 · 2019-10-11T00:18:57.663

No do not trust the p-value.

1 It does not convey whether or not you have an effect.

The main issue should be whether or not the effect (the effect size) that you measure is relevant or not. You say that you measured $\rho = 0.21$ and that this is important in your field. Then you should report it.

The p-value is more to be seen as an indicator of the accuracy of your experiment. If your experiment is not accurate, either due to large noise or due to small sample size, then even in the absence of an effect it might be likely to observe an effect in the noise (the p-value tells how likely).

In your case, the correlation, the p-value is often computed based on a the statistic $$t = \rho \sqrt{\frac{n-2}{1-\rho^2}}$$ Wich is t-distributed with $\nu = n-2$ degrees of freedom when certain assumptions a are right (more on that later).

This means that the p-value is related to the measured correlation and the sample size. Let's see how this looks:

The graph shows how the significance depends on both the measured correlation and the sample size (the lines are contour lines for p-values 0.001, 0.01, 0.02, 0.05, 0.1). Note that: For the same measured effect (e.g. correlation of 0.21) you can have different significance depending on the experiment (the sample size). (thus if significance is 'not good enough' it may depend on the experiment)

It would be wrong to say that there is no effect (while measuring $\rho = 0.21$) just because you did not have significance above some arbitrary level. Instead, you should conclude that there may be an effect, but the significance indicates that your experiment needs to be repeated/refined (improved accuracy) in order to be more sure.
The correlation is just one way to express that there is an effect. It is only limited to linear relationships. You may have a strong (non-linear) relationship between your variables but still a low correlation (and if this plays a role then it makes that you have even more reasons to care less about the p-value)

Make a plot in order to see better what is going on. See more here: Anscombe's quartet

2 The underlying assumptions for the computation may be wrong.

The computation of the p-value of a correlation is ambiguous. There are different ways. When you use the earlier mentioned t-statistic then your assumptions are that the two variables are independent uncorrelated normal distributed variables. But you may instead have some other distribution for your data (e.g. some wider tails). In that case a bootstrap method may be better.

Example. Let your data be two identical independent distributed Bernoulli variables (with $p_{succes} = 0.05$). Let's simulate this situation and see how the p-values are distributed (it should be a uniform distribution).

These Bernoulli distributed variable are not something to which one would normally apply a correlation and calculation of p-value. However, it is a simple model for the cases where you have a continous distribution that is multimodal distribution.

You could do similar simulations with different variables. In general the observed p-values are underestimating the true probability (say a p-value below x% will in reality be occuring more often than x% of the cases). So your computed p-value p=0.06, might be underestimating the true p-value (if you use the t-distribution and the assumptions are not right).

Philosophical

In addition the difference between p=0.05 and p=0.06 is not very relevant. But it is a bit difficult to say at which value there is a 'border' between yes/no significant. This is related to Sorites paradox. My point of view is that it is a bit of a false dichotomy to considere that there is some boundary. The concept of p-values and significance is not black and white (and the imposed boundaries, which are unrealistic, will be in practice very arbitrary).

Practice

Power analysis Normally you avoid these issues by computing beforehand what sort of sample you need in order to be able to accurately measure in the range of the expected effect sizes.
Two-one-sided t-tests. Besides testing the null hypothesis (does my data/experiment) correspond with or counter the null hypothesis, you can also consider evaluating whether your data/experiment corresponds with the alternative hypothesis. This is done with the two-one-sided t-tests. You can have the situation that your data is neither (significantly) disagreeing with null-hypothesis (absence of effect) nor with an alternative hypothesis (some minimal level of the effect).
Ideally, you report all your values. And not just the significant ones. (but maybe you mean by 'reporting the value' something like 'discuss the value in the text')

Can you please explain what the lines in your first graph show? It is not really clear to me reading your answer. — Ben, Oct 11 '19 at 00:02
"In that case, a bootstrap method may be better", ah, I forget to mention that I run a permutation testing already. — Lucia, Oct 11 '19 at 10:07
I notice a downvote. I am willing to improve this answer when someone gives a pointer. — Sextus Empiricus, Oct 11 '19 at 16:59

score 3 · Answer 5 · answered Oct 23 '19 at 00:37

In general, you should not choose to report results on the basis of significance or agreement with your goals.

I agree with you that a p-value of .06 isn't much different from .04 (as others stated, a p-value is a continuous summary of how the observed data are "compatible" with the specific null hypothesis and smaller p-value means lower compatibility). Therefore, they (.04 vs .06) both convey mild (very mild in a typical observational study) evidence contradicting the null hypothesis and the alpha threshold is not a magical number.

Second, your advisor is unequivocally incorrect in the interpretation of "... 'you should report there are no correlations between these two variables, the p-value is not significant.'" This is a mistake to interpret lack of significance as 'no relationship/correlation'. Please see Point # 6, at a minimum. This is a rudimentary logical fallacy that is generalized as "absence of evidence equals evidence of absence", which we know is false for various reasons, one of which is the problem of induction.

Your advisor is best served to read the below reference.

https://link.springer.com/article/10.1007/s10654-016-0149-3#Sec2

score 2 · Answer 6 · answered Oct 10 '19 at 22:21

EDIT: This answer assumes that, as written, this was an example of a data sleuthing exercise. However, comments reveal a much different scenario plays out here.

This is an example of Munchausen's Statistical Grid in reverse. The question then becomes: how many subjects do I have to remove before the result is no longer statistically significant? And the answer is (if I can deliberately remove high influence/high leverage observations) not that many! This is at it should be, an ideal study is powered commensurately with its effect size. For instance, if I want 90% power to reject the null hypothesis at the 0.05 level, I should be plenty pleased with my sample size calculation if after conducting my trial I reject the null only just at that level. Any sample fewer and I fail to reject the null. Any sample in excess and I have spent too much money or time on my study.

Removing observations reduces power. This is not interesting.

I would respond that deletion diagnostics are useful for identifying high leverage and high influence observations HOWEVER without a preplanned analysis to remove those observations, the results of doing so are meaningless.

score 2 · Answer 7 · answered Oct 11 '19 at 19:16

May I rephrase your question as "Should I report the p-value when estimating correlation"? I would answer this question with "no": report a confidence interval for your measured correlation instead!

This will make it clear whether your results are compatible with the results reported in the literature (just check whether these results fall into your confidence interval). On the other hand, if your p-value of the hypothesis $H_0:\,r=0$ is 0.06 and those of other studies is less than 0.05, this does not mean that your result contradicts the other studies.

Concerning the remark of your superviser: the correlation in your case is so small (0.21) that you need a large sample size for obtaining a confidence interval not including zero. You can always make the tiniest correlation "statistically significant" simply by increasing the sample size, the smaller the correlation is, the larger, however, must the sample size be to make it "significant". That's why I would not report the p-value, but the measured value with a confidence interval. It seems to me that your results are in agreement with the other studies, if they also report a merely weak positive correlation.

Acknowledgements: I am not the first to make this recommendation ;-)

score 0 · Answer 8 · answered Oct 12 '19 at 00:22

I partially agree with your advisor. Sometimes, even statistically significant results can be not at all significant to report.

You need to think whether the size of the sample correlation is large enough to make a meaningful statement. As an extreme case, let say the true correlation is in fact 0.01. If you have large enough participants, you can still get a very small p-value (since it is non-zero!). However, depending on the context, 0.01 correlation could mean nothing. In your case, the true correlation can be non-zero but still is too small to be detected by 71 samples. I think a better discussion topic with the advisor is about whether the effect size is large enough to report not about whether the test is statistically significant.