Probability that Null Hypothesis is True

Question

So, this may be a common question, but I’ve never found a satisfactory answer.

How do you determine the probability that the null hypothesis is true (or false)?

Let’s say you give students two different versions of a test and want to see if the versions were equivalent. You perform a t-Test and it gives a p-value of .02. What a nice p-value! That must mean it’s unlikely that the tests are equivalent, right? No. Unfortunately, it appears that P(results|null) doesn’t tell you P(null|results). The normal thing to do is to reject the null hypothesis when we encounter a low p-value, but how do we know that we are not rejecting a null hypothesis that is very likely true? To give a silly example, I can design a test for ebola with a false positive rate of .02: put 50 balls in a bucket and write “ebola” on one. If I test someone with this and they pick the “ebola” ball, the p-value (P(picking the ball|they don’t have ebola)) is .02, but I definitely shouldn’t reject the null hypothesis that they are ebola-free.

Things I’ve considered so far:

Assuming P(null|results)~=P(results|null) – clearly false for some important applications.
Accept or reject hypothesis without knowing P(null|results) – Why are we accepting or rejecting them then? Isn’t the whole point that we reject what we think is LIKELY false and accept what is LIKELY true?
Use Bayes’ Theorem – But how do you get your priors? Don’t you end up back in the same place trying to determine them experimentally? And picking them a priori seems very arbitrary.
I found a very similar question here: stats.stackexchange.com/questions/231580/. The one answer here seems to basically say that it doesn't make sense to ask about the probability of a null hypothesis being true since that's a Bayesian question. Maybe I'm a Bayesian at heart, but I can't imagine not asking that question. In fact, it seems that the most common misunderstanding of p-values is that they are the probability of a true null hypothesis. If you really can't ask this question as a frequentist, then my main question is #3: how do you get your priors without getting stuck in a loop?

Edit: Thank you for all the thoughtful replies. I want to address a couple common themes.

Definition of probability: I'm sure there is a lot of literature on this, but my naive conception is something like "the belief that a perfectly rational being would have given the information" or "the betting odds that would maximize profit if the situation was repeated and unknowns were allowed to vary".
Can we ever know P(H0|results)? Certainly, this seems to be a tough question. I believe though, that every probability is theoretically knowable, since probability is always conditional on the given information. Every event will either happen or not happen, so probability doesn't exist with full information. It only exists when there is insufficient information, so it should be knowable. For example, if I am told that someone has a coin and asked the probability of heads, I would say 50%. It may happen that the coin is weighted 70% to heads, but I wasn't given that information, so the probability WAS 50% for the info I had, just as even though it happens to land on tails, the probability WAS 70% heads when I learned that. Since probability is always conditional on a set of (insufficient) data, one can never not have enough data to determine it and so it should always be (theoretically) knowable.
Edit: "Always" may be a little too strong. There may be some philosophical questions for which we can't determine probability. Still, in real-world situations, while we can "almost never" have absolute certainty, there should "almost always" be a best estimate.

If your 'null hypothesis' is something like $H_{0}: \theta = 0$, that is, that some difference is zero, then rejecting it means you have found strong enough evidence that $H_{A}: \theta = 0$. You could instead for a null hypothesis like $H_{0}: |\theta| \ge \Delta$, that is, that some difference is at least as big as $\Delta$ (where $\Delta$ is what the researcher deems the smallest difference they care about), and rejecting means that you found $H_{A}: |\theta| < \Delta$ (i.e. $-\Delta < \theta < \Delta$). See tests for equivalence https://stats.stackexchange.com/tags/tost/info — Alexis, Sep 27 '17 at 21:01
The power of an experiment (and of the statistical test analyzing the experiment's outcomes) is the probability that if there were an effect of a given size or larger, that the experiment wouold detect it at a given threshold of significance. https://www.statisticsdonewrong.com/power.html — Bennett Brown, Sep 28 '17 at 06:13
see https://stats.stackexchange.com/questions/166323/misunderstanding-a-p-value/166327#166327 — , Sep 28 '17 at 12:58
Your coin example is a good one. It shows that you can never know P(H0|results) if you only know the results *and make no further assumptions*. Do you *know* the probability of heads in a given throw 'assuming' a certain fairness of the coin? Yes. (but this is hypothetical, given the assumptions, and you will never know if your assumptions are true) Do you *know* the probability of heads in a given throw while knowing a number of previous outcomes. No! and it does not matter how large the number of previous outcomes you know. You can not *exactly* know the probability heads in the next throw. — Sextus Empiricus, Sep 28 '17 at 15:02

score 13 · Answer 1 · answered Sep 27 '17 at 20:39

13

You have certainly identified an important problem and Bayesianism is one attempt at solving it. You can choose an uninformative prior if you wish. I will let others fill in more about the Bayes approach.

However, in the vast majority of circumstances, you know the null is false in the population, you just don't know how big the effect is. For example, if you make up a totally ludicrous hypothesis - e.g. that a person's weight is related to whether their SSN is odd or even - and you somehow manage to get accurate information from the entire population, the two means will not be exactly equal. They will (probably) differ by some insignificant amount, but they won't match exactly. ' If you go this route, you will deemphasize p values and significance tests and spend more time looking at the estimate of effect size and its accuracy. So, if you have a very large sample, you might find that people with odd SSN weigh 0.001 pounds more than people with even SSN, and that the standard error for this estimate is 0.000001 pounds, so p < 0.05 but no one should care.

answered Sep 27 '17 at 20:39

Peter Flom

94,055
35
143
276

1

Not that I disagree with you, but don't you think when he worries about p(data|H0) or p(H0|data) he's talking about studies with low $n$. The example you give is easy in both frameworks bayesian and frequentist because their respective weaknesses/subjectivity don't matter in light of abundant data. The only error you can still make in this situation that would matter is to confuse significance with effect size. – David Ernst Sep 27 '17 at 20:45
Well, we can never really know p(H0|data). Not with a small sample and not with a large sample and I don't think Bayesians really get around this, but I'm no expert on Bayesianism. But even with a small amount of data, effect size estimates are important. – Peter Flom Sep 28 '17 at 10:57
1

Good point about effect size. Is there an analogue to situations like testing for a disease, where the question is Boolean in nature? – Kalev Maricq Sep 28 '17 at 13:57
1

FWIW, I'm perfectly willing to believe that there is no relationship between a person's weight & whether their SSN is odd or even. In an observational study, these variables will be correlated w/ some other variables, etc, such that there is ultimately a non-0 marginal association. I think the valid point is that, for most things researchers invest their time to investigate, there is some decent reason to suspect that there is real a non-0 effect. – gung - Reinstate Monica Sep 28 '17 at 14:53
1

@gung you can believe whatever you want, but there is definitely a non-zero relationship between weight and SSN. We do know anything more about the relationship other than its existence and that it is probably small. – emory Sep 28 '17 at 15:34
That's surprising @emory. How do you know that? Has it been studied? Can you cite the paper? Are you referring to a direct relationship, or a marginal one? (Nb, I stated that there would be a marginal association.) – gung - Reinstate Monica Sep 28 '17 at 15:44
1

I know that weight is a continuous variable. Although we might record it as an integer number of kilograms. Your comment was about an observational study (drawing inferences about a population based on a sample). Since my study is funded by hypothetical dollars it is a population study using infinite precision scales - no need for statistical inference. – emory Sep 28 '17 at 15:57
Say you have $X \sim \mathcal{N}(\mu,\sigma)$ then the density f(X) may be non-zero but $P(X=0) \equiv lim_{d->0} P( 0-d – Sextus Empiricus Sep 28 '17 at 19:25
1

@KalevMaricq If it's just the prevalence of a disease, then you can use the prevalence. If it is something else (e.g. a logistic regression) you can use the effect size from whatever else it is. – Peter Flom Sep 29 '17 at 12:12
@gung I suppose that, since individuals' weights are recorded to the nearest pound (or kg) you *could* get no effect at all in the population, it seems so unlikely that it can be considered impossible. Divide 300 million integers into two piles of 150 million each. Make the integers be rounded from a normal distribution if you like. There are a LOT of ways to do this and a vanishingly small number where the piles will be equal. – Peter Flom Sep 29 '17 at 12:16
@emory, so you "know" this because you proved it in a hypothetical study? That's [begging the question](https://en.wikipedia.org/wiki/Begging_the_question). The population, for standard statistical inference, is infinite; it's all the people who *could* be under the given conditions, not the people who *happen* to be alive today. The question of rounding is also a red herring; weights vary be enough for that to be inconsequential & w/ enough data, an arbitrarily small difference could be resolved, even w/ rounding. I also can't tell if you understand my distinction about marginal associations. – gung - Reinstate Monica Sep 29 '17 at 12:53
@MartijnWeterings, I don't follow your argument. Are you trying to prove that there has to be a direct relationship b/t all possible variables from first principles? What is "X" supposed to be here? There either is a relationship, or there isn't. The probability can only be either $1$ or $0$. I don't see what the density of a normal distribution has to do with that, but then, I don't follow your argument & you may not even be trying to address my comments. – gung - Reinstate Monica Sep 29 '17 at 13:00
Gung, If the sample size approaches the population size then the SEM becomes incredibly small and we will see, with high probability, a deviation from some theoretic thought. This can be understood by the population itself being a sample from a theoretic distribution. And at this point my thought comes along (which was more like addressing the comments from @emore, or at least, I was trying to fill a gap that I may not have understood), it is very likely that a population is different from a theoretic $H_0$, because these hypotheses are often defined as a point rather than an interval. – Sextus Empiricus Sep 29 '17 at 13:30
...Or in other words. We use a sample to estimate a population and compare the population with theory. The observed population may be "proof" (e.g. p>0.05 as in Peter's example) that the population is different from zero. But the observation of the population being different from zero (with high precision) is not necessarily strong proof that the theory is wrong, and actually it is very likely that we do not observe the mean value of some model P(population mean = theoretic mean) = 0 (in many cases). And, the error of the population estimate, is not the error that the theory allows. – Sextus Empiricus Sep 29 '17 at 13:40
@gung the population - the set of people who have ever been assigned a social security number and are still alive - is very large but definitely finite. It is in theory possible to weigh them all and produce descriptive (not inferential) population statistics. We commonly think of it as infinite b/c it is easier computationally and the difference is negligible, but it is not. – emory Sep 29 '17 at 13:57
@emory, that's not the population, that's a sample from an infinite theoretical population. We don't ultimately care only about these exact people, but anyone who could be assigned an SSN. In theory, we could get measures of everyone today & find weights for evens are +1x10^-200, wait 10 years (so that there's a slightly different set of existing people) & find that weights of evens are -1x10^-200. Do we believe that even SSNs have *both* a positive *and* a negative effect on weight? That isn't what people are really trying to infer & neither would tell us if there is a direct relationship. – gung - Reinstate Monica Sep 29 '17 at 14:08
1

@gung I implicitly took the population as the set of people who currently have SSN and you are taking the population as the set of people who have ever had or ever will have SSN. Even then the population is still finite - much less than 2^64. The weigh ins will take much longer and pose some interesting research questions - like at what life point do we weigh individuals - birth, SSN issuance, or death; and how to obtain weights for the deceased. The measurements will be of no practical value b/c they will not be available until after our present society has radically changed. – emory Sep 29 '17 at 16:21
A lot of the confusion here would be clarified if we abandon the finite population reference. Finite populations are not the scientifically interesting targets of inference. Say the even SSNs have higher mean weight in some existing population than the odd SSNs. Is that a scientifically interesting fact? Does it generalize? Is there some causal explanation? Obviously no, no, and no. Except in demography studies, finite populations are not interesting as targets of inference. Replace "population" with "data generating process," and then you have something that is scientifically interesting. – BigBendRegion Dec 06 '19 at 13:55

score 3 · Answer 2 · answered Sep 27 '17 at 21:04

In order to answer this question, you need to define probability. This is because the null hypothesis is either true (except that it almost never is when you consider point null hypotheses) or false. One definition is that my probability describes my personal belief about how likely it is that my data arose from that hypothesis compared to how likely it is that my data arose from the other hypotheses I'm considering. If you start from this framework, then your prior is merely your belief based on all your previous information but excluding the data at hand.

Good point. I think my idea of probability is something like "the perfectly rational belief" instead of my personal one. I edited my question to address your points. — Kalev Maricq, Sep 28 '17 at 14:07

score 2 · Answer 3 · answered Sep 27 '17 at 22:43

The key idea is that, loosely speaking, you can empirically show something is false (just provide a counterexample), but you cannot show that something is definitely true (you would need to test "everything" to show there are no counterexamples).

Falsifiability is the basis of the scientific method: you assume a theory is correct, and you compare its predictions to what you observe in the real world (e.g. Netwon's gravitational theory was believed to be "true", until it was found out that it did not work too well in extreme circumstances).

This is also what happens in hypothesis testing: when P(results|null) is low, the data is contradicting the theory (or you were unlucky), so it makes sense to reject the null hypothesis. In fact, suppose null is true, then P(null)=P(null|results)=1, so the only way that P(results|null) is low is that P(results) is low (tough luck).

On the other hand, when P(results|null) is high, who knows. Maybe null is false, but P(result) is high, in which case you cannot really do anything, besides designing a better experiment.

Let me reiterate: you can only show that the null hypothesis is (likely) false. So I would say the answer is half of your second point: you don't need to know P(null|results) when P(results|null) is low in order to reject null, but you cannot say null is true it P(results|null) is high.

This is also why reproducibility is very important: it would be suspicious to be unlucky five times out of five.

"you can empirically show something is false" I believe that rejection of a hypothesis is similarly problematic as acception. A p-value is not equal to the probability that the null hypothesis is false. Otherwise, in the sense of Alexis' comment on the OP we could define $H_0:$ |result|>a. And proof it to be false by finding counter examples when observing result — Sextus Empiricus, Sep 28 '17 at 12:24
I agree with Martijn. If you can tell me how to determine the probability that the null hypothesis is false, I would consider that a successful answer to my question. — Kalev Maricq, Sep 28 '17 at 14:17
also note that P(result|null) being small can be normal even if the null is true. For instance if we observe the average in 1000 dice rolls, $\mu_{1000}$, then $P(\mu_{1000}=3.50)$ is small even for a fair dice. p-values are constructed differently than P(result|null), and are more precisely made to define the type I error, by describing 'result' as 'the result at which we reject'. In that way we have type I error as P(null rejected | null true) = P(rejection result|null). So imagine the null is true (hypothetically) then we have P(rejection result|null) probability to make a type I error. — Sextus Empiricus, Sep 28 '17 at 19:45

Sextus Empiricus · Answer 4 · 2017-09-29T21:25:21.637

-----------------------------------------------------------------------

(edit: I think it would be useful to put a version of my comment to this question on top in this answer, as it is much shorter)

The non symmetric computation of p(a|b) occurs when it is seen as a causal relationship, like p(result|hypothesis). This computation does not work in both directions: a hypothesis causes a distribution of possible results, but a result does not cause a distribution of hypotheses.

P(result|hypothesis) is theoretic value based on the causation relationship hypothesis -> result.

If p(a|b) expresses a correlation, or observed frequency (not necessarily a causal relation), then it becomes symmetric. For instance if we write down the number of games a sports teams wins/loses and number of games sports team scores less than or equal as / more than 2 goals in a contingency table. Then P(win|score>2) and P(score>2|win) are similar experimental/observational (not theoretic) objects.

---------------------------------------------------------------------

Very simplistic

The expression P(result|hypothesis) seems so simple that it makes one easily think that you can simply reverse the terms. However, 'result' is a stochastic variable, with a probability distribution (given the hypothesis). And 'hypothesis' is not (typically) a stochastic variable. If we make 'hypothesis' a stochastic variable then it implies a probability distribution of different possible hypothesises, in the same way as we have a probability distribution of different results. (but results does not give us this probability distribution of hypothesis, and merely changes the distribution, by means of Bayes theorem)

An example

Say you have a vase with red/blue marbles in a 50/50 ratio from which you draw 10 marbles. Then you can easily express something like P(outcome|vase experiment), but it makes little sense to to express P(vase experiment|outcome). Outcome is (on it's own) not the probability distribution of different possible vase experiments.

If you have multiple possible types of vase experiments, in that case it is possible to use express something like P(type of vase experiment) and use Bayes rule to get a P(type of vase experiment|outcome), because now the type of vase experiment is a stochastic variable. (note: more precisely it is P(type of vase experiment|outcome & distribution of type of vase experiments))

Still, this P(type of vase experiment|outcome) requires a (meta-)hypothesis about a given initial distribution P(type of vase experiment).

Intuition

maybe the expression below helps to understand the one direction

X) We can express the probability of X given a hypothesis about X.

thus

1) We can express the probability for results given a hypothesis about the results.

and

2) We can express the probability of a hypothesis given a (meta-)hypothesis about these hypothesises.

It is Bayes rule that allows us to express an inverse of (1) but we need (2) for this, hypothesis needs to be a stochastic variable.

Rejection as solution

So we can not obtain a absolute probability for a hypothesis given the results. That is a fact of life, trying to fight this fact seems to be the origin of not finding a satisfactory answer. The solution to find a satisfactory answer is: accepting that you can not get a (absolute) probability for a hypothesis.

Frequentists

In the same way as not being able to accept a hypothesis, we should neither (automatically) reject the hypothesis when P(result|hypothesis) is close to zero. It only means that there is evidence that supports change of our believes and it depends also on P(result) and P(hypothesis) how we should express our new believes.

When frequentists have some rejection scheme then that is fine. What they express is not wether a hypothesis is true or false, or the probability for such cases. They are not able to do that (without priors). What they express instead is something about the failure rate (confidence) of their method (given certain assumptions are true).

Omniscient

One way to get out all of this is to elliminate the concept of probability. If you observe the entire population of 100 marbles in the vase then you can express certain statements about a hypothesis. So, if you become omniscient and the concept of probability is irrelevant, then you can state wether a hypothesis is true or not (although probability is also out of the equation)

Your vase example makes sense. However, in real life, we almost never know how many marbles of each color are in the vase. I always find myself with a question more like "Are there more red marbles than blue" and my data is that I drew 4 red marbles and 1 blue marble from the vase. Now, I can make assumptions like "there are probably ~100 marbles and each marble is either red or blue with 50% probability" but in real life, I often find myself at a loss for how to non-arbitrarily and non-circularly get these priors. — Kalev Maricq, Sep 28 '17 at 14:04
That is more an epistemological question than a problem about probability. An expression like P(result|hypothesis) is in a similar way "false", I mean, it is a hypothetical expression. You can express the probability for a result, given a certain **hypothetical** believe about 'reality'. In the same way as a probability for an experimental outcome is hypothetical, an expression for the probability of some theory (with or without some observation of a result), requires a certain hypothetical believe about 'reality'. Yes, priors are somewhat arbitrary. But so is a hypothesis. — Sextus Empiricus, Sep 28 '17 at 14:34
Talking about the probabilities. Note that Bayes rule is about two stochastic variables: P(a|b) P(b) = P(b|a) P(a). You can relate the conditional probabilities. If one of those P(b|a) is a *causal* relationship, as in 'theory leads to distribution of outcomes', then you can calculate it exact. Such case is only because the (1directional) causality. The hypothesis allows to know (hypothetic) everything you need, the marbles in the vase. The other way around, does not work. An experimental outcome 4 red vs 1 blue, does not *cause* the probability distribution of marbles in the vase. — Sextus Empiricus, Sep 28 '17 at 14:44