4

Let's say, for whatever ungodly reason, I want to precisely calculate p-values by hand, without looking up tables for approximate values.

What is the exact specific formula for arriving at a p-value, given a t-stat and degrees of freedom?

  • 2
    There's a previous, largely relevant answer here [How does one generate the table mapping t-test values to p values?](https://stats.stackexchange.com/a/73515/805), though my present answer gives more details. – Glen_b Feb 14 '22 at 07:47
  • What do you mean by 'the exact mathematical relationship'? Are you looking for some small formula or some algorithm to compute a value given some input? – Sextus Empiricus Feb 14 '22 at 16:50
  • The t-distribution is some multidimensional equivalent of a tangent https://stats.stackexchange.com/a/365070/164061. In the case of small degrees of freedom this is even very direct and when for instance $\nu = 1$ then $$F(t) = \frac{1}{2} + \frac{1}{\pi} \text{arctan}(t)$$So you are sort of asking something like 'what is the exact formula for the tangent function?'. At which level do you want to get a formulation for the t-distribution? – Sextus Empiricus Feb 14 '22 at 17:13

3 Answers3

7

Accurate p-values for t-tests are not really feasible "by hand"; note that you can work out p-values from the cdf, so I will focus on that.

At very low, integer degrees of freedom, the cdf can be computed -- you can obtain the cdf via integration by parts, whence you can descend 2 d.f. each time but the length of the expression grows each time, ending up with a finite series of terms ending either at the 2 d.f case or the 1 d.f. case, which are integrable.

Explicit expressions are given for the first few degrees of freedom in the t-distribution page at Wikipedia (see the link below).

Outside that - non-integer d.f. or large d.f. - there's no practical, simple expression for it (in the sense of something you can expect to easily write down as a short closed form expression or do on a cheap calculator; there are some calculator-type approximations but typically not of great accuracy).

More generally you need to be able to evaluate something equivalent to the regularized incomplete beta function, which mathematical libraries for a variety of computer languages will have built in, or the hypergeometric function $_2F_1$.

See https://en.wikipedia.org/wiki/Student%27s_t-distribution#Cumulative_distribution_function

Alternatively, you can use some function specifically designed to numerically approximate the t-cdf. Computer functions may use a variety of methods to approximate the cdf of the t at almost any required degrees of freedom, typically for d.f.s up to many thousands at least (and beyond which the normal approximation will usually be easily sufficient), and will usually give quite a few figures of accuracy for almost all of the range (e.g. 6-8 decimal digits, say), aside for the most extreme tails.

There's some mention of specific examples of such approximations here: Mathematically, how are the critical t values calculated?

Per the above Wikipedia links, the cdf for the $t_\nu$ can be written as:

$$\int_{-\infty}^t f(u)\,du = \tfrac{1}{2} + t\frac{\Gamma \left( \tfrac{1}{2}(\nu+1) \right)} {\sqrt{\pi\nu}\,\Gamma \left(\tfrac{\nu}{2}\right)} \, {}_2F_1 \left( \tfrac{1}{2}, \tfrac{1}{2}(\nu+1); \tfrac{3}{2}; -\tfrac{t^2}{\nu} \right),$$

where

$${}_2F_1(a,b;c;z) = 1 + \frac{ab}{c}\frac{z}{1!} + \frac{a(a+1)b(b+1)}{c(c+1)}\frac{z^2}{2!} + \cdots.$$

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • I'm just looking for the exact formula, regardless as to the feasibility of calculating it by hand. – Johansson McFleppers Feb 14 '22 at 05:29
  • 5
    There is no closed-form formula. You want an infinite series? To what end? What can you do with it? This *really* sounds like an X-Y problem. You want a formula for some purpose, but if you explain the purpose we can probably tell you a way to solve your problem that isn't infeasible/useless/impractical. – Glen_b Feb 14 '22 at 05:30
  • There isn't anything I can do with it. I'm not asking because I expect to be calculating it by hand instead of using software. I'm just curious what the formula is. – Johansson McFleppers Feb 14 '22 at 05:32
  • 2
    You know that the normal distribution's cdf doesn't have a closed form either, right? There are many ways to represent it -- infinite series, infinite continued fractions, etc, Take a look at the Wikipedia page for the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution#Cumulative_distribution_function). Similarly you can write [infinite series](https://en.wikipedia.org/wiki/Hypergeometric_function#The_hypergeometric_series) for the hypergeometric function, of which the t-cdf is a special case. (That link gives an explicit power series in terms of the arguments, have fun.) – Glen_b Feb 14 '22 at 05:37
  • (Well, a special case of $_2F_1$ plus a scale and shift) – Glen_b Feb 14 '22 at 07:22
  • I have added the $t$ cdf in terms of $_2F_1$ and its power series details in turn, which come directly from the links I had previously offered. It's an example of an "exact" mathematical expression in the general case. Some exact small integer df results can be found in the links as well. – Glen_b Feb 14 '22 at 07:30
  • 3
    If I had to compute this by hand I would first select the least painful approach for the specific parameters and apply it. In many cases, it would likely be numerical integration of the density function using Simpson's Rule. I do this sort of thing all the time for quick one-off mental calculations of approximate p-values: it works very well for smooth, nearly-symmetric unimodal distributions. The hypergeometric function power series would usually be a poor choice because, unless $\nu$ is large, it needs many terms. – whuber Feb 14 '22 at 14:40
  • Yeah, it's a good thing to mention;; I also sometimes use Simpson's rule (or other such numerical integration, like Weddle's rule) when working by hand particularly if I have nothing better than a calculator or pencil and paper but without a lot of effort it's not "accurate" in the way that a computer implementation is. A few significant digits is often sufficient for a single use but not a general substitute for a computer implementation where you don't know how it might be used. I may add a small note to my answer. – Glen_b Feb 15 '22 at 21:14
2

Let $X\sim t_{df}$ be a t-distributed random variable with degrees of freedom $df$, and let $F_X(x)$ be the CDF of $X$.

A p-value comes from evaluating $F$ at the t-stat $t$: $F_X(t)$. We the either take that value as the p-value, subtract that value from $1$, double that value (two-sided), or subtract that value from $1$ and then double the difference (two-sided).

$$F_X(t)\\ 1-F_X(t)\\ 2F_X(t)\\ 2(1-F_X(t)) $$

Unfortunately, there is not an analytic expression for $F_X(x)$, so we would obtain these values by calculating them in software (such as the pt function in R), consulting a table, or doing numerical integration with pencil and paper.

Dave
  • 28,473
  • 4
  • 52
  • 104
1

Suppose you have $n=40$ observations sampled at random from $\mathsf{Norm}(\mu=50, \sigma=10).$ You know data are normal, but do not know $\mu$ or $\sigma.$

Sampling in R and data summary:

set.seed(2022)
x = rnorm(40, 50, 10)
summary(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.99   43.05   50.20   49.29   56.75   62.12 
[1] 8.960198   # sample SD

You use a t test in R to test $H_0: \mu = 52$ against $H_a: \mu < 52.$

t.test(x, mu=52, alt="less")

        One Sample t-test

data:  x
t = -1.9126, df = 39, p-value = 0.03158
alternative hypothesis: true mean is less than 52
95 percent confidence interval:
      -Inf 51.67738
sample estimates:
mean of x 
 49.29037 

You can get the 5% critical value $c = -1.685$ from a printed table of t distributions (using symmetry). So, because $T= -1.9126 < c,$ you know that the null hypothesis is rejected at the 5% level. Such tables were originally computed (mainly) by numerical integration from the densities of Student's t distribution with various degrees of freedom. [Depending on where you find them, recent tables may be from statistical software, such as R or SAS.]

qt(0.05, 39)   # Inverse CDF of Student's t
[1] -1.684875

Ordinarily, you can bracket the P-value of such a one sided test between tabled values of the tail probability, here maybe between $0.025$ and $0.05,$ depending on the level of detail of the table.

In R, you could compute the exact P-value as shown below. The value $0.03158$ in the output of t.test is rounded to five places.

pt(-1.9126, 39)  # CDF of Student's t
[1] 0.03158175

The exact value above may be computed by R, using a suitable rational approximation of the CDF of $\mathsf{T}(\nu=39)$ or by numerical integration of the PDF; I suspect that details may depend on the degrees of freedom. [See current R documentation.]

BruceET
  • 47,896
  • 2
  • 28
  • 76
  • Maybe see @whuber's answer on [this page](https://stats.stackexchange.com/questions/52341/formula-to-generate-critical-t-values-for-t-test-instead-of-using-a-look-up-arr). – BruceET Feb 14 '22 at 07:46