2

I'm trying to evaluate performance of a metric learning model. The model that takes labelled image inputs and maps them to vectors on an N-dimensional unit sphere. The goal of the model is to map images to vectors such that same-labelled images map to vectors that are close together by dot product, and such that different-labelled images map to vectors that are far apart by dot product.

To evaluate the performance of this model, I'm running the model over a corpus of held-out data, and I'm interested in quantile statistics on the distribution of dot products between pairs of output vectors. Questions of interest are things like:

  • What is the threshold $t$ such that 99% of equal-labelled pairs of vectors have a dot product greater than $t$?
  • What is the threshold $t$ such that 99% of unequal-labelled pairs of vectors have a dot product less than $t$?

My evaluation dataset is small enough that I can directly compute dot products for all matching pairs of inputs, but there are enough non-matching pairs of inputs that it's very expensive to compute dot products for all non-matching pairs. Instead, for non-matches I'm estimating by taking a random sample from the space of non-matching pairs. This obviously introduces some error, but that's acceptable as long as the error bounds are well-understood. The problem is that it's not clear how to choose an appropriate sample size, nor is it clear how to characterize the error bounds for a given sample size.

Thus, my main question is this: If I want to estimate a quantile $q$ of a population by taking a sample of size $N$, is there a simple way to compute $N$ such that my quantile estimate is within a desired error bound? Alternatively, is there a better way to estimate (error bounds of) quantile statistics via sampling?

For my specific application, it seems likely that there are ways to take advantage of the specific structure of the samples (e.g., the fact that they're dot products between pairs of vectors from a well-known set). I'm interested in answers that take advantage of that structure, but I'm also interested in the more general question of how to think about estimating quantile statistics from an unknown distribution.

ssanderson
  • 21
  • 3
  • 2
    The analysis of the median in my post at https://stats.stackexchange.com/a/86804/919 applies *mutatis mutandis* to give exact distributions of any sample quantile. Invert these to find minimal sample sizes. The result is known as a *nonparametric tolerance limit.* – whuber Dec 27 '21 at 20:58
  • @whuber thanks for the pointer. I think I understand how the argument at the top of stats.stackexchange.com/a/86804/919, suitably modified, would show that the distribution of the sample quantile q should look like Beta(nq + 1, n(1-q) + 1). Is that right? If so, I think that implies that for large n the sample quantile should be ~normal, and should have variance approximately p(1-p)/((n)f(μ)^2) for large n (agreeing with Avraham's answer below). If I knew f(μ), then I think I could use that to estimate an appropriate n, but isn't f(μ) the thing I'm trying to estimate here already? – ssanderson Dec 28 '21 at 15:20
  • You only need to estimate $f$ in a vicinity of its $q^\text{th}$ quantile. That can be done by means of the order statistics near that quantile. – whuber Dec 28 '21 at 15:28
  • Doesn't that introduce a chicken-and-egg problem though? My goal is to pick a sample size to use to estimate the quantile, but it seems like I need an estimate of the quantile in order to pick the sample size. Would your suggestion be to use an iterative process? E.g., pick a sample size, use it to make an estimate of f, use that to estimate the error, and then repeat with a larger sample if the error estimate is higher than desired? – ssanderson Dec 28 '21 at 15:37
  • 1
    As in *all* sample size estimation problems, you have to make guesses and you have the same chicken-and-egg situation. Most of them turn out to be iterative for just that reason. What we have accomplished so far, though, is to identify *what property is important to guess at* and *how to measure it.* Those are important considerations and getting clear about them reflects a useful advance in your thinking. – whuber Dec 29 '21 at 20:22

2 Answers2

1

In finance the quantiles are used to calculate the metric called value-at-risk (VaR). A common technique for backing testing VaR is coverage tests, such as Kupiec test. They use estimates of confidence intervals of quantiles, which you can apply to your problem. The simplest formulas are based on binomial distribution, see the simple description here https://www.value-at-risk.net/backtesting-coverage-tests/

Aksakal
  • 55,939
  • 5
  • 90
  • 176
0

Let $n$ be the number of observations and let $p$ be the percentile of interest. Let $X_{[np]}$ be the order statistic in the $np$th slot. In other words, the emprirical quantile value. So long as the Central Limit Theorem holds, it has been shown (1, 2) that the empirical quantiles converge in distribution to a normal distribution whose mean is the true quantile value and whose standard deviation is $\frac{p(1-p)}{n\left[f_X(x)\right]^2}$. In other words:

$$ \sqrt{n}\left(X_{[np]} - x\right) \xrightarrow{d} X \sim \mathcal{N}\left(0, \frac{p(1-p)}{\left(f_X(x)\right)^2}\right) $$

where $f_X(x)$ is the true density evaluated at the quantile. If you know what the underlying distributional form is, you can calculate it using the empirical quantile as an estimator for the true quantile. If not, I'm not sure how kosher using the empirical pdf is, but I don't see a better choice.

That being said, this gives you an estimate for the variance of the normal, and you can bound it by increasing $n$.

This may be somewhat easier than implementing Dr. Huber's complete method.

Avraham
  • 3,182
  • 21
  • 40
  • 1
    The nonparametric solution is more accurate, general, and simpler than this, because the count of sample values exceeding any given quantile of the underlying distribution has a Binomial distribution, thereby reducing this question to that of sampling a Binomial variable, *provided the error bound is framed in terms of the percentage and not the value.* – whuber Dec 27 '21 at 22:02