What exactly is critical value in statistics? I confuse it with p-value and always get it wrong

Question

I learnt that a p-value lower than the alpha value of 0.05 leads us to reject the null hypothesis and vice versa. It seems to be the opposite for critical values. What then, is this critical value - how is it calculated and what does it signify?

Glen_b · Answer 1 · 2019-11-07T06:51:19.803

Broadly, hypothesis testing (in a more or less Neyman-Pearson framework) works as follows.

You state some null and alternative hypothesis. I'll start by treating the null as a point null (most nulls you'll deal with will be of this form) but the ideas can be adapted to composite nulls with some modest modifications.

You choose a test statistic - which acts as a summary of the sample - and some assumptions (which will enable you to compute the distribution of the test statistic when the null is true). The test statistic should be chosen such that the behaviour of the test statistic is different under the null and alternative.

The task now is to identify a rejection region for the test statistic; those values most strongly suggestive of the alternative. As you move away from what's typical under the null toward what's more indicative of the alternative, we'll regard the values of the test statistic as 'more extreme'. For example if we were comparing two means, a larger absolute difference in mean would be more extreme in this sense. We wish to reject the null for the most extreme cases - the samples having the most extreme test statistic. Those will form the rejection region.

We now need a border between rejection and non rejection - the least extreme test statistic we'd still reject for. We choose it so that if the null is true we won't reject more than a selected fraction of all possible samples from the population.

When it's a single value, that border value is the critical value. The selected fraction is the significance level, conventionally denoted as $\alpha$.

So if you adopt the rejection rule "reject when the test statistic is at least as extreme as the critical value", you will reject a true null hypothesis no more than the chosen fraction $\alpha$ of the time. If we chose our rest statistic well (in that it discriminates well between the null being true and the alternative being true) we should also have good power.

Note that there is no need for p-values anywhere in the above -they weren't even mentioned. Critical values are more 'fundamental' than p-values. However, p-values are quite useful.

What's a p-value? Its the probability of getting a test statistic at least as extreme as the one from our sample if the null hypothesis was true. As such it represents the smallest significance level we could have rejected the null at.

Imagine you choose some significance level and reject, but someone reading your report may want to use a smaller significance level. Do you need to list all possible significance levels and the corresponding decision?

No - you can just give a p-value. If its small you're far into the most extreme cases and if its large you're looking at just the kind of value for your test statistic you'd expect to see if the null was the case rather than the alternative.

Each person reading your report can immediately see if your p-value is smaller than their own significance level, and so see if it is in their own rejection region for your test.

Let's consider a very simple example. We wish to see whether two samples have been drawn from a common distribution, or under the alternative that typically the population the second sample was drawn from tends to be larger than the first.

We have no specific distributional model in mind but we're broadly interested in alternatives similar to location-shifts.

[Note that a likelihood-based argument doesn't help us at all here because we don't know enough to impose a specific distributional model (and so we can't specify a likelihood). Nevertheless all the above discussion works. Clearly seeking to motivate everything in terms of likelihood is not going to be adequate for tests in general.]

When one sample is shifted relative to another, you will tend to see one sample "stick out the high end" and the other will tend to "stick out the low end". So one possible test statistic is the number of values from sample 2 higher than any value from sample 1 plus the number of values from sample 1 lower than any value from sample 2.

[If the values at one end are from "the wrong sample" compared to what we would expect with the alternative, we add nothing.]

This is purely a rank-based test and will be nonparametric.

Imagine we have 4 values in sample 2 and 3 in sample 1. If they were from the same distribution, what would be the chance that all of the largest values would be in sample 2? (Note that our 'sticking out the ends' test statistic would be 7)

It's a simple matter of counting to discover that the probability is $1/{7\choose 3} = 1/35$.

In this case there are no more extreme test statistics. The next less extreme possible test statistic is 5 (which also has a 1/35 probability), which comes with a sample ordering A A B A B B B (where A represents an observation from sample 1 and B represents an observation from sample 2, and the ordering indicates sorting from smallest to largest)

In this case we are free to choose a significance level of 1/35 (2.86%) or 2/35 (5.71%)* - or something higher.

If we choose to use $\alpha=0.0286$ we would only reject when the test statistic was 7 -- when all values from sample 2 exceed all values from sample 1. The critical value is 7.

If instead we choose to use $\alpha=0.0571$ we would reject when the test statistic was 5 or 7, when at least 5 values in total "stuck out the ends" in the direction we expected under the alternative. The critical value is 5.

To compute a p-value we need to know the possible statistics; it's not difficult to check all 35 possible sample arrangements (I got R to do this for me):

> table(ts)
ts
 0  1  2  3  4  5  7 
10 10  7  4  2  1  1

And so here's the upper tail probabilities, which are the p-values for each possible test statistic:

print(rev(cumsum(rev(table(ts))))/35,3)
     0      1      2      3      4      5      7 
1.0000 0.7143 0.4286 0.2286 0.1143 0.0571 0.0286

* ignoring randomized tests.

score 0 · Answer 2 · answered Nov 07 '19 at 03:42

When you exceed the critical value, it means that there is not much probability density in the more extreme direction, hence the inverse relationship between critical value and alpha.

The critical value is the value of your test statistic, say a t-stat for a t-test of the mean, that you must exceed in order to reject the null hypothesis at your specified alpha-level (often 0.05 but not necessarily). If the critical value is 2 and we calculate $t=2.1$, we know that we will reject the null. We still have some more work to do to get the p-value (pt is the R function), but we know we will reject the null hypothesis.

(Technical point: I say exceed, but we also reject if we get exactly the critical value, equivalent to $p=\alpha$, no matter how unlikely this is.)

The critical value depends on alpha! As we decrease alpha, the critical value will increase because we demand stronger evidence to reject the null hypothesis.

What exactly is critical value in statistics? I confuse it with p-value and always get it wrong

2 Answers2