2

I have been given quantiles (min, 25%, med, 75%, max) for items of data, along with the size of the data n. From these pieces of information I would like to obtain a random sample of data points.

Apart from the trivial solutions where n ≤ 5, is there any way of doing this?

My attempt at a solution:

After some research I believe my best option is to obtain a distribution from these quantiles and then use inverse transform sampling to randomly sample n items from a given distribution which would give me n random data points that roughly agreed with the quantiles given.

However I am struggling to find digestible reading material on how I can obtain this distribution, from domain knowledge I suspect this distribution will be highly negatively skewed (Gumbel minimum / minimum extreme distribution)

Here are some related threads:

Estimating a distribution based on three percentiles

Estimate distribution from 4 quantiles

https://www.johndcook.com/blog/2010/01/31/parameters-from-percentiles/

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
JDraper
  • 187
  • 6

1 Answers1

2

The raw quantiles do not uniquely define a distribution. (Unless you have additional information, like that it is normal. In which case the question is whether the quantiles are actually consistent with a normal distribution.)

I would recommend that you draw

  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_0, q_{.25}]$
  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_{.25}, q_{.5}]$
  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_{.5}, q_{.75}]$
  • $\frac{n}{4}$ data points that are uniformly distributed in $[q_{.75}, q_1]$

If your $n$ is large and the distances between the quantiles vary much, then this may yield a somewhat "unnatural" histogram:

Histogram

nn <- 1e6
quantiles <- c(0,2,6,12,20)

set.seed(1)

xx <- c(
    runif(nn/4,quantiles[1],quantiles[2]),
    runif(nn/4,quantiles[2],quantiles[3]),
    runif(nn/4,quantiles[3],quantiles[4]),
    runif(nn/4,quantiles[4],quantiles[5]))

hist(xx)

If this is a problem for you, then you may want to prespecify a distribution, fit this to the quantiles provided and sample from the distribution, per above. Or try fitting a kernel density estimate to your quantiles and sample from that.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357