0

I have data on the 10th, 25th, 50th, 75th, and 90th percentiles of a probability distribution, together with the mean, and standard deviation. I am interested in recovering a continuous distribution that would match or approximate well these data points with a flexible family of distributions.

One possibility is to let the density be flat between the percentiles and choose the 0th and 100th percentiles to match the mean and variance. This procedure, unfortunately, does not lead to reasonable results.

I have to do this many times, but just for concreteness here is one example:

$$p_{10}=-0.89, \quad p_{25}= -0.20, \quad p_{50}= 0.08, \quad p_{75}= 0.33, \quad p_{90}= 0.71,$$

with

$$ \text{mean} = -0.21 \quad\text{and}\quad\text{std dev}=2.25.$$

I have tried the Generalized normal distribution, and the Exponentially modified Gaussian distribution, but they do not seem to be able to approximate the percentiles well enough.

I have also tried the method proposed in the answer to this question using many different distributions besides the standard normal, this allows to approximate the percentiles well, but then the standard deviation is always underestimated.

Any suggestions would be very welcome!

mzp
  • 65
  • 7
  • Can the 0% and 100% percentiles be $-\infty$ and $\infty$ respectively? I wouldn't try to rebuild those percentiles,as many well-known distributions behaive this way – David Jul 04 '19 at 07:28
  • 1
    I am afraid that little amount of data will give you not too much information unless you restrict your choices to a very particular set of possible distributions – David Jul 04 '19 at 07:30
  • Yes, in principle the support could be the real line, but I think it would be sensible to impose bounds if that is going to help to match the moments. I am not sure I understand what you mean with your second comment, could you elaborate? – mzp Jul 04 '19 at 13:30
  • But the 0% and 100% percentiles are often something hard to estimate. With my second comment, I mean that, unless you already have a pretty clear idea of what distribution family the data could be, you won't be able to guess it only from such little information – David Jul 04 '19 at 13:38
  • My prior was the same: many distributions should be able to deliver these numbers, so they don't identify the distribution. But my prior has shifted substantially, I've tried pretty much all suitable distributions I could find and none are able to get close. I either get the percentiles or the standard deviation right, never both to a reasonable degree. – mzp Jul 04 '19 at 13:42
  • Well, we don't know the sample size, so it's hard to know "how far is too far". Or do you need the quantiles to match EXACTLY the values you've given? – David Jul 04 '19 at 13:44
  • I have not been able to get a distribution to be less than 10% away from all these numbers. [With a step function I can choose $p_0$ and $p_{100}$ to match all exactly, but then $p_0$ and $p_{100}$ are not reasonable, so I am trying to not choose $p_0$ and $p_{100}$ that way and instead impose some functional form to the distribution.] My guess is that I need a lot of kurtosis and none of the functions I used are able to deliver. – mzp Jul 04 '19 at 13:50
  • By the way, I actully know the sample size, is that helpful? – mzp Jul 04 '19 at 13:54
  • Bigger samples sizes give you reasons to be more precise about the real location of quantiles. For example, if the sample size were small, you could say that a normal $N(-0.21, 2.55^2)$ does good enough. If the sample size is big, the differences start to seem more important – David Jul 04 '19 at 14:04
  • I see, thanks. The sample size is of the order of 100 thousand. – mzp Jul 04 '19 at 14:13
  • As [@David](https://stats.stackexchange.com/users/238499/david) mentioned, having an idea of the distribution family is going to be the first step. I'd then look for Quantile-matching estimation methods for those probable distributions. – dangiankit Jul 08 '19 at 02:35

0 Answers0