Let's say, for whatever ungodly reason, I want to precisely calculate p-values by hand, without looking up tables for approximate values.
What is the exact specific formula for arriving at a p-value, given a t-stat and degrees of freedom?
Let's say, for whatever ungodly reason, I want to precisely calculate p-values by hand, without looking up tables for approximate values.
What is the exact specific formula for arriving at a p-value, given a t-stat and degrees of freedom?
Accurate p-values for t-tests are not really feasible "by hand"; note that you can work out p-values from the cdf, so I will focus on that.
At very low, integer degrees of freedom, the cdf can be computed -- you can obtain the cdf via integration by parts, whence you can descend 2 d.f. each time but the length of the expression grows each time, ending up with a finite series of terms ending either at the 2 d.f case or the 1 d.f. case, which are integrable.
Explicit expressions are given for the first few degrees of freedom in the t-distribution page at Wikipedia (see the link below).
Outside that - non-integer d.f. or large d.f. - there's no practical, simple expression for it (in the sense of something you can expect to easily write down as a short closed form expression or do on a cheap calculator; there are some calculator-type approximations but typically not of great accuracy).
More generally you need to be able to evaluate something equivalent to the regularized incomplete beta function, which mathematical libraries for a variety of computer languages will have built in, or the hypergeometric function $_2F_1$.
See https://en.wikipedia.org/wiki/Student%27s_t-distribution#Cumulative_distribution_function
Alternatively, you can use some function specifically designed to numerically approximate the t-cdf. Computer functions may use a variety of methods to approximate the cdf of the t at almost any required degrees of freedom, typically for d.f.s up to many thousands at least (and beyond which the normal approximation will usually be easily sufficient), and will usually give quite a few figures of accuracy for almost all of the range (e.g. 6-8 decimal digits, say), aside for the most extreme tails.
There's some mention of specific examples of such approximations here: Mathematically, how are the critical t values calculated?
Per the above Wikipedia links, the cdf for the $t_\nu$ can be written as:
$$\int_{-\infty}^t f(u)\,du = \tfrac{1}{2} + t\frac{\Gamma \left( \tfrac{1}{2}(\nu+1) \right)} {\sqrt{\pi\nu}\,\Gamma \left(\tfrac{\nu}{2}\right)} \, {}_2F_1 \left( \tfrac{1}{2}, \tfrac{1}{2}(\nu+1); \tfrac{3}{2}; -\tfrac{t^2}{\nu} \right),$$
where
$${}_2F_1(a,b;c;z) = 1 + \frac{ab}{c}\frac{z}{1!} + \frac{a(a+1)b(b+1)}{c(c+1)}\frac{z^2}{2!} + \cdots.$$
Let $X\sim t_{df}$ be a t-distributed random variable with degrees of freedom $df$, and let $F_X(x)$ be the CDF of $X$.
A p-value comes from evaluating $F$ at the t-stat $t$: $F_X(t)$. We the either take that value as the p-value, subtract that value from $1$, double that value (two-sided), or subtract that value from $1$ and then double the difference (two-sided).
$$F_X(t)\\ 1-F_X(t)\\ 2F_X(t)\\ 2(1-F_X(t)) $$
Unfortunately, there is not an analytic expression for $F_X(x)$, so we would obtain these values by calculating them in software (such as the pt
function in R), consulting a table, or doing numerical integration with pencil and paper.
Suppose you have $n=40$ observations sampled at random from $\mathsf{Norm}(\mu=50, \sigma=10).$ You know data are normal, but do not know $\mu$ or $\sigma.$
Sampling in R and data summary:
set.seed(2022)
x = rnorm(40, 50, 10)
summary(x); sd(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.99 43.05 50.20 49.29 56.75 62.12
[1] 8.960198 # sample SD
You use a t test in R to test $H_0: \mu = 52$ against $H_a: \mu < 52.$
t.test(x, mu=52, alt="less")
One Sample t-test
data: x
t = -1.9126, df = 39, p-value = 0.03158
alternative hypothesis: true mean is less than 52
95 percent confidence interval:
-Inf 51.67738
sample estimates:
mean of x
49.29037
You can get the 5% critical value $c = -1.685$ from a printed table of t distributions (using symmetry). So, because $T= -1.9126 < c,$ you know that the null hypothesis is rejected at the 5% level. Such tables were originally computed (mainly) by numerical integration from the densities of Student's t distribution with various degrees of freedom. [Depending on where you find them, recent tables may be from statistical software, such as R or SAS.]
qt(0.05, 39) # Inverse CDF of Student's t
[1] -1.684875
Ordinarily, you can bracket the P-value of such a one sided test between tabled values of the tail probability, here maybe between $0.025$ and $0.05,$ depending on the level of detail of the table.
In R, you could compute the exact P-value as shown below.
The value $0.03158$ in the output of t.test
is rounded to five places.
pt(-1.9126, 39) # CDF of Student's t
[1] 0.03158175
The exact value above may be computed by R, using a suitable rational approximation of the CDF of $\mathsf{T}(\nu=39)$ or by numerical integration of the PDF; I suspect that details may depend on the degrees of freedom. [See current R documentation.]