Density function from percentiles (P10, P25, P75, P90, mean and median)?

Question

I have percentile data (P10, P25, P75 and P90) for a variable.
I also have the mean and median for each group:

group    mean   median  P10     P25     P75     P90
1        30100  26200   19900   22500   32800   44200
2        38700  36600   28000   31500   44000   52100

How do I:

Create a probability density function based on these variables.
Use that function to give me the % in specific step intervals? (I.e. answering the question: How many out of 100 are in the 30000-31000 interval for group 2?)

Thanks.

Do you assume any underlying distribution of your variable -- like a normal distribution? — krlmlr, Feb 23 '12 at 10:42
This is income data per occupation in a single country so I'm not sure what to assume? Thanks. — dani, Feb 23 '12 at 11:49
Then you'll have to assume a distribution in the first place. Otherwise there's no way to tell which PDF fits the data best. -- Also, isn't P50 supposed to be the same as the median? — krlmlr, Feb 23 '12 at 12:09
OK, I guess I should assume a normal distribution then? Does that make sense looking at the data? And of course you're right, it should be P75 instead of P50!! — dani, Feb 23 '12 at 12:20
Then: What is P100, the maximum in each group? Do you have this data? — krlmlr, Feb 23 '12 at 12:27
Nope, this is absolutely the only data I have, from which I want to draw a curve. — dani, Feb 23 '12 at 12:40
A very closely related thread appears at http://stats.stackexchange.com/q/6022: it concerns the same problem where only three percentiles are known for data about amounts of government funding, which also has a distinctly skewed distribution. One answer includes clearly-explained working `R` code. — whuber, Feb 23 '12 at 15:08

onestop · Accepted Answer · 2012-02-23T13:56:39.993

2

The distributions are clearly positively skewed, so a normal distribution wouldn't be appropriate. Economists often seem to assume that income has a log-normal distribution, so that would probably be a good choice if it fits OK. To check that, you could log the data and then construct a normal probability plot for each group by plotting the logged percentiles (ignore the mean but include the median as the 50th percentile) against the percentiles of a standard normal distribution. If the points lie roughly on a straight line then the log-normal distribution is a reasonable fit. You could then estimate its parameters by fitting a straight line by least squares - that's not the optimal method, but it's simple and probably good enough.

Update: Just tried that myself: enter image description here

Log-normal seems an reasonable fit in group 2, but not so good in group 1. I don't know if it might still be good enough for your purposes. If not you might need to go to some three-parameter distribution, but that could get a fair bit more complicated.

edited Feb 23 '12 at 13:56

answered Feb 23 '12 at 13:21

onestop

16,816
2
53
83

I couldn't have written this better :-) See [also](http://stats.stackexchange.com/q/6022/6432) [these](http://stats.stackexchange.com/q/12397/6432) [questions](http://stats.stackexchange.com/q/12930/6432). – krlmlr Feb 23 '12 at 13:47
Thanks! Great explanation! I think the log-normal will do (is R2 > 0.9 a good indication of this?) Two follow up questions: How do you construct the standard normal quantiles from -2 to 2? And how do I go from these parameters to a density function? Thanks. – dani Feb 23 '12 at 14:08
How do you know it's clearly positively skewed without plotting it? – rasen58 Jun 06 '17 at 19:00

Density function from percentiles (P10, P25, P75, P90, mean and median)?

1 Answers1