Test which distribution has a "longer tail"

Question

I have measured two non-negative random variables, A and B. Their true underlying probabilities are unknown, however, it may be assumed that the probabilities are largest at zero and monotonically decrease for larger values. Most certainly, those values also have an upper bound and I have a guess for it, but that guess is not very good.

I would like to test if the "tail" of $P(A)$ "goes further" than the tail of $P(B)$. It looks like it does, but maybe that is by chance? What metrics could I consider? I have tried to check the mean, but it seems comparable for both variables.

When you say that you have measured the variables, how many observations do you have ? Can you repeat those observations ? If yes, you can probe your random variables to obtain an estimate of the densities $P(A)$ and $P(B)$, and hence compare their tails. — Camille Gontier, Feb 13 '20 at 16:10
@CamilleGontier I am afraid the answers to those questions are not very important. (1) I have several thousand datapoints, but what matters is how close the tails are, that number may be too much or too few (2) I can't repeat the experiment, but I can split data into training and test set (3) I will not to density estimation, because it is not robust, especially at the tail which is guaranteed to be undersampled. I require a hypothesis test, something that will give me a rigorous number as an answer — Aleksejs Fomins, Feb 13 '20 at 16:16
Long tails, fat tails, and heavy tails are not synonyms: https://stats.stackexchange.com/questions/10726/differences-between-heavy-tail-and-fat-tail-distributions, https://stats.stackexchange.com/questions/168851/example-of-heavy-tailed-distribution-that-is-not-long-tailed. It will help if you can identify the kind of difference you want to identify. If your distributions are not bounded, then neither "goes further" than the other, as both shoot off to $\pm \infty$. — Dave, Feb 13 '20 at 16:57
@Dave Thank you for the suggestion. Please pardon my ignorance, I am not an expert in statistics. I will read the provided link — Aleksejs Fomins, Feb 13 '20 at 17:01
From the question it sounds possible that what you're actually interested in is not the tail behavior per se -- that is, not the thickness of the tails -- but rather just the maximums of the two variables, i.e., max(A) vs. max(B). Do you agree with this? — Jake Westfall, Feb 13 '20 at 18:57
@JakeWestfall I don't think I completely agree. Imagine a distribution consisting of several bins. We can take the last smallest bin and stretch it as far as we like, making it less an less probable. I believe that estimating max(A) of an unknown distribution is impossible with finite data sample. Instead, I need something like an "effective" maximum - the one that can realistically be falsified given the number of measurements is known — Aleksejs Fomins, Feb 14 '20 at 12:50

whuber · Accepted Answer · 2020-02-13T18:26:31.177

The essential features of this question are:

It does not make strong distributional assumptions, lending it a non-parametric flavor.
It concerns only tail behavior, not the entire distribution.

With some diffidence--because I have not studied my proposal theoretically to fully understand its performance--I will outline an approach that might be practicable. It borrows from the concepts behind the Kolmogorov-Smirnov test, familiar rank-based non-parametric tests, and exploratory data analysis methods.

Let's begin by visualizing the problem. We may plot the empirical distribution functions of the datasets on common axes to compare them:

The black curve shows dataset $A$ (here with $m=50$ values) and the red curve shows dataset $B$ (here with $n=100$ values). The height of a curve at a value $x$ shows the proportion of the dataset with values less than or equal to $x.$

This is a situation where data in the upper half of $A$ consistently exceed the data in the upper half of $B.$ We can see that because, scanning from left to right (low values to high values), the curves last cross around a height of $0.5$ and after that, the curve for $A$ (black) remains to the right of -- that is, at higher values than -- the curve for $B$ (red). That's evidence for a heavier right tail in the distribution from which data $A$ are drawn.

We need a test statistic. It must be a way of somehow quantifying whether and by how much $A$ has a "heavier right tail" than $B.$ My proposal is this:

Combine the two datasets into a dataset of $n+m$ values.
Rank them: this assigns the value $n+m$ to the highest, $n+m-1$ to the next highest, and so on down to the value $1$ for the lowest.
Weight the ranks as follows:
- Divide the ranks for $A$ by $m$ and the ranks for $B$ by $n.$
- Negate the results for $B.$
Accumulate these values (in a cumulative sum), beginning with the largest rank and moving on down.
Optionally, normalize the cumulative sum by multiplying all its values by some constant.

Using the ranks (rather than constant values of $1,$ which is another option) weights the highest values where we want to focus attention. This algorithm creates a running sum that goes up when a value from $A$ appears and (due to the negation) goes down when a value from $B$ appears. If there's no real difference in their tails, this random walk should bounce up and down around zero. (This is a consequence of the weighting by $1/m$ and $1/n.$) If one of the tails is heavier, the random walk should initially trend upwards for a heavier $A$ tail and otherwise head downwards for a heavier $B$ tail.

This provides a nice diagnostic plot. In the figure I have normalized the cumulative sum by multiplying all values by $1/\sqrt{n+m+1}$ and indexing them by the numbers $q = 0/(m+n), 1/(m+n), \ldots, (m+n-1)/(m+n).$ I call this the "cranksum" (cumulative rank sum). Here is the first half, corresponding to the upper half of all the data:

There is a clear upward trend, consistent with what we saw in the previous figure. But is it significant?

A simulation of the cranksums under the null hypothesis (of equally heavy tails) will settle this question. Such a simulation creates many datasets of the same sizes as the original $A$ and $B$ (or, almost equivalently, creates many arbitrary permutations of the combined dataset) according to the same distribution (which distribution it is doesn't matter, provided it is continuous); computes their cranksums; and plots them. Here are the first thousand out of 40,000 that I made for datasets of size $50$ and $100:$

The faint gray jagged curves in the middle form the assemblage of a thousand cranksum plots. The yellow area, bounded in bold curves (the "envelope"), outlines the upper $99.25$ and lower $0.75$ percentiles of all 40,000 values. Why these percentiles? Because some analysis of these simulated data showed that only 5% of the simulated curves ever, at some point, go past these boundaries. Thus, because the cranksum plot for the actual data does exceed the upper boundary for some of the initial (low) values of $q,$ it constitutes significant evidence at the $\alpha=0.05$ level that (1) the tails differ and (2) the tail of $A$ is heavier than the tail of $B.$

Of course you can see much more in the plot: the cranksum for our data is extremely high for all values of $q$ between $0$ and $0.23,$ approximately, and only then starts dropping, eventually reaching a height of $0$ around $q=0.5.$ Thus it is apparent that at least the upper $23\%$ of the underlying distribution of data set $A$ consistently exceeds the upper $23\%$ of the underlying distribution for dataset $B$ and likely the upper $50\%$ of ... $A$ exceeds the upper $50\%$ of ... $B.$

(Because these are synthetic data, I know their underlying distributions, so I can compute that for this example the CDFs cross at $x=1.2149$ at a height of $0.6515,$ implying the upper $34.85\%$ of the distribution for $A$ exceeds that of $B,$ quite in line with what the cranksum analysis is telling us based on the samples.)

Evidently it takes a little work to compute the cranksum and run the simulation, but it can be done efficiently: this simulation took two seconds, for instance. To get you started, I have appended the R code used to make the figures.

#
# Testing whether one tail is longer than another.
# The return value is the cranksum, a vector of length m+n.
#
cranksum <- function(x, y) {
  m <- length(x)
  n <- length(y)
  i <- order(c(x,y))
  scores <- c(rep(1/m, m), rep(-1/n, n)) * rank(c(x,y))
  cumsum(scores[rev(i)]) / sqrt(n + m + 1)
}
#
# Create two datasets from two different distributions with the same means.
#
mu <- 0          # Logmean of `x`
sigma <- 1/2     # Log sd of `x`
k <- 20          # Gamma parameter of `y`
set.seed(17)
y <- rgamma(100, k, k/exp(mu + sigma^2/2)) # Gamma data
x <- exp(rnorm(50, mu, sigma))             # Lognormal data.
#
# Plot their ECDFs.
#
plot(ecdf(c(x,y)), cex=0, col="00000000", main="Empirical CDFs")
e.x <- ecdf(x)
curve(e.x(x), add=TRUE, lwd=2, n=1001)
e.y <- ecdf(y)
curve(e.y(x), add=TRUE, col="Red", lwd=2, n=1001)
#
# Simulate the null distribution (assuming no ties).
# Each simulated cranksum is in a column.
#
system.time(sim <- replicate(4e4, cranksum(runif(length(x)), runif(length(y)))))
#
# This alpha was found by trial and error, but that needs to be done only 
# once for any given pair of dataset sizes.
#
alpha <- 0.0075
tl <- apply(sim, 1, quantile, probs=c(alpha/2, 1-alpha/2)) # Cranksum envelope
#
# Compute the chances of exceeding the upper envelope or falling beneath the lower.
#
p.upper <- mean(apply(sim > tl[2,], 2, max))
p.lower <- mean(apply(sim < tl[1,], 2, max))
#
# Include the data with the simulation for the purpose of plotting everything together.
#
sim <- cbind(cranksum(x, y), sim)
#
# Plot.
#
q <- seq(0, 1, length.out=dim(sim)[1])
# The plot region:
plot(0:1/2, range(sim), type="n", xlab = "q", ylab = "Value", main="Cranksum Plot")
# The region between the envelopes:
polygon(c(q, rev(q)), c(tl[1,], rev(tl[2,])), border="Black", lwd=2, col="#f8f8e8")
# The cranksum curves themselves:
invisible(apply(sim[, seq.int(min(dim(sim)[2], 1e3))], 2, 
          function(y) lines(q, y, col="#00000004")))
# The cranksum for the data:
lines(q, sim[,1], col="#e01010", lwd=2)
# A reference axis at y=0:
abline(h=0, col="White")

At first glance, this looks very similar to the log-rank statistics? — Cliff AB, Feb 14 '20 at 00:34

score 1 · Answer 2 · answered Feb 13 '20 at 16:29

I would suggest to fit different distributions on your observations, and to perform model selection to find the distribution that fits your observations the best. Exponential and Pareto distributions seem to be the best candidates given your hypotheses (positivity, monotone decrease). Once you have fitted these candidates distributions, model selection criteria, such as the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) will give you a quantitative score for each model. The following paper will propose rules on how to interpret the evidences for the different models based on the BIC :

Kass, Robert E., and Adrian E. Raftery. "Bayes factors." Journal of the american statistical association 90.430 (1995): 773-795.

You may also want to have a look at this paper which deals with model inference of long-tailed distributions :

Okada, Makoto, Kenji Yamanishi, and Naoki Masuda. "Long-tailed distributions of inter-event times as mixtures of exponential distributions." arXiv preprint arXiv:1905.00699 (2019).

Thank you for your suggestion. As far as know, conclusions based on Bayesian Model Selection rely on the assumption that one of the models is the true model. I don't know how to ensure this, or how to estimate the error I may be introducing by choosing the wrong model. I could use Kolmogorov-Smirnov to test if the data are consistent with the proposed distribution, but for real mesured data from unknown distribution the KS test for a simple distribution is almost always negative. Further, the distributions you have suggested have infinite tails. How do I compare A and B then if I fit them? — Aleksejs Fomins, Feb 13 '20 at 17:12
This doesn't look appropriate to the question unless the fitting *is in the tails only.* — whuber, Feb 13 '20 at 17:12

score 0 · Answer 3 · answered Feb 14 '20 at 14:01

0

The OP wants a metric for "tail length." While that term is not precisely defined, one might assume that "tail heaviness" is desired. Both Pearson-based and quantile-based kurtosis are measures of tail heaviness. (This useful application of kurtosis has long been unused because of the incorrect notion that kurtosis measures "peakedness" rather than "tail heaviness.") See here https://math.stackexchange.com/questions/3521769/graphic-representation-of-kurtosis-and-skewness/3532888#3532888 for a clear explanation of why Pearson kurtosis measures tail heaviness.

Estimate such a tail heaviness by using the data for each sample, and find the sampling distribution of the difference. Use this sampling distribution to assess (or test, as the OP wants) the "true magnitude" of the difference between the heaviness of the tails. You could use the bootstrap here, although some kinds of parametric or smoothed bootstrap analysis may be more reliable. (Tail heaviness is very difficult to estimate because it is only the rare extreme values in the data (or outliers) that provide the relevant information, and there are by definition very few of such data points.)

answered Feb 14 '20 at 14:01

BigBendRegion

4,593
12
22

Although it's good to point out the need for a more specific concept of "tail length," I would suggest that this answer over-interprets the question, because the question only asks how to test whether one tail is "fatter" than another. That doesn't require any metric nor even any concept of "heaviness" in the tail. Kurtosis certainly is inappropriate, because it depends just as much on the *lower* tail (which is irrelevant here) as on the upper tail. – whuber Feb 14 '20 at 17:18
I did mention quantile-type measures; you could define one for the upper tail. You could also define a one-sided version of Pearson kurtosis easily enough. And the OP did ask for a metric. ("What metrics could I consider?"), so I was just trying to be helpful. – BigBendRegion Feb 14 '20 at 18:25
And actually, since ordinary Pearson kurtosis nearly discards the data within one sd of the mean, it probably would serve just fine for the OP's intent. – BigBendRegion Feb 14 '20 at 18:35
I was going to disagree ;-), but then realized you are cleverly relying on an unstated assumption that is likely true: with the positive variables described in the OP (unimodal with mode at zero), there probably isn't much to the lower tail, so higher moments might work. Why kurtosis, though? Its recommendation raises two interesting questions. First, why the *fourth* moment; second, why center it at the *mean*? For instance, an absolute moment about zero (or some standardized version thereof) would likely work better. – whuber Feb 14 '20 at 18:44
Sure, many other options are possible, such as right tail kurtosis, median centering, IQR scaling, etc. It just depends on how far in the weeds does the OP wish to go. At least with kurtosis, there is something tried and true; no need to justify an unusual metric. And even if there is something the left tail, it will be negligible compared to the right. Skewness would also work here, though, as a measure of right tail heaviness for the OPs intended application. Higher moments could work as well, but they will be perhaps too dependent on the extremes. – BigBendRegion Feb 15 '20 at 22:12

Test which distribution has a "longer tail"

3 Answers3

Linked