9

Let's say we have the following dataframe:

       TY_MAX
141  1.004622
142  1.004645
143  1.004660
144  1.004672
145  1.004773
146  1.004820
147  1.004814
148  1.004807
149  1.004773
150  1.004820
151  1.004814
152  1.004834
153  1.005117
154  1.005023
155  1.004928
156  1.004834
157  1.004827
158  1.005023
159  1.005248
160  1.005355

25th: 1.0031185409705132
50th: 1.004634349800723
75th: 1.0046683578907745
Calculated 50th: 1.003893449430644

I am a bit confused here. If we get 75th prcentile, 75% of data should be below that percentile. And if we can 25th percentile, 25% of data should be below that 25th. Now i am thinking that 50% of data should be between 25th and 50th. And also 50th percentile gives me a different value. Fair enough, which means 50% of data should be below this value. But my question is if my approach correct?

EDIT: And also can we say 98% of data will be between 1st-99th of percentile?

Don Coder
  • 435
  • 4
  • 10

2 Answers2

30

Yes.

  • 75% of your data are below the 75th percentile.
  • 25% of your data are below the 25th percentile.
  • Therefore, 50% (=75%-25%) of your data are between the two, i.e., between the 25th and the 75th percentile.
  • Completely analogously, 98% of your data are between the 1st and the 99th percentile.
  • And the bottom half of your data, again 50%, are below the 50th percentile.

These numbers may not be completely correct, especially if you have low numbers of data. Note also that there are different conventions on how quantiles and percentiles are actually computed.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • 5
    another reason why your numbers may be off is when you have a lot ties (observations with the same value) – Maarten Buis Jul 31 '18 at 13:00
  • Thank you very much. May i also ask what is the most common used percentile? and for most accurate method, how much of data required do you think? – Don Coder Jul 31 '18 at 13:01
  • @StephanKolassa beat me with a quick answer yet again! Sorry for double-posting after you had already submitted an answer. – ERT Jul 31 '18 at 13:06
  • 4
    "Most common used percentile" - do you mean which *type* as per the `type` argument in [R's `quantile()`](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/quantile.html)? [Hyndman & Fan](http://doi.org/10.2307/2684934) recommend type 7, which is also the default. To be quite honest, the differences are minor. Or do you mean what percentage is commonly used? That will depend on your application, we can't help you with that. And of course, the more data you get, the more accurate you will be. What level of accuracy is enough will depend on your data and your application. – Stephan Kolassa Jul 31 '18 at 13:18
  • Actually i am asking what percentile is commonly used. What i am planning to do is get 25th-75th percentile. Calculate the last 100 rows and how many times data was between 1sth-99th. If it is less than 50% than this means there is a huge opportunity the data will be between 25th-75th percentile. Maybe i am wrong but this was what i was hoping to sort out – Don Coder Jul 31 '18 at 13:23
  • 2
    What level you need will depend on what you will use your analysis for. – Stephan Kolassa Jul 31 '18 at 13:25
  • 2
    "Not completely correct, especially if you have low numbers of data." - might be worth clarifying this as there are at two factors I can see at play: (1) sample size may not be exactly divisible by 4 or 100 or whatever is needed for the quantile in question; (2) data points may not be unique (e.g. for data on a whole number, 1-to-5 scale, you can expect many repeated value; quartiles in that case can behave *very* badly with respect to properties like "50% of data lie above the median" or "between Q1 and Q3" and percentiles are often a waste of time) – Silverfish Jul 31 '18 at 16:32
  • That Hyndman & Fan paper is seriously fan! (+1 obviously) – usεr11852 Jul 31 '18 at 23:36
  • 1
    @StephanKolassa, it seems Hyndman & Fan recomended type 8. (Which is also mentioned in `?quantile`.) – Axeman Aug 01 '18 at 12:00
  • @Axeman: good catch! I seem to have misread. Quote: "Further details are provided in Hyndman and Fan (1996) who recommended type 8. The default method is type 7, as used by S and by R < 2.0.0. " – Stephan Kolassa Aug 01 '18 at 16:14
2

Ideally, yes.

Percentiles are usually interpreted in terms of the normal distribution (as normality is often an underlying, sometimes unstated, assumption when computing any sort of elementary statistical measures). The distribution does not have to be normal, however.

According to this website...

The standard normal distribution can also be useful for computing percentiles. For example, the median is the 50th percentile, the first quartile is the 25th percentile, and the third quartile is the 75th percentile. In some instances it may be of interest to compute other percentiles, for example the 5th or 95th. The formula below is used to compute percentiles of a normal distribution: $X = \mu + Z \sigma$

So, if we assume normality, we can easily compute any percentile we are looking for. Percentiles require no distributional assumptions, however, and are bound to the data from which they are computed. This means that percentiles can provide meaningful benchmarks for both normal and non-normal distributions. You may also use percentiles in a probability interpretation, of course based on the measurements you currently have, which could be good or bad indicators of the true underlying distribution.

According to this site...

Direct interpretation: consider the 10th ($P_{10}$) and 90th ($P_{90}$) percentiles: "given the available data, we know that soil property $p < P_{10}$ 10% of the time, and, $p < P_{90}$ 90% of the time". This same statement can be framed using probabilities or proportions: "given the available data, soil property $p$ is within the range of {$P_{10} − P_{90}$} 80% of the time".

ERT
  • 1,265
  • 3
  • 15