30

Are there 99 percentiles, or 100 percentiles? And are they groups of numbers, or divider lines, or pointers to individual numbers?

I suppose the same question would apply for quartiles or any quantile.

I have read that the index of a number at a particular percentile(p), given n items, is i = (p / 100) * n

That suggests to me that there are 100 percentiles.. because supposing you have 100 numbers(i=1 to i=100), then each would have an index(1 to 100).

If you had 200 numbers, there'd be 100 percentiles, but would each refer to a group of two numbers. Or 100 dividers excluding either the far left or far right divider 'cos otherwise you'd get 101 dividers. Or pointers to individual numbers so the first percentile would refer to the second number, (1/100)*200=2 And the hundredth percentile would refer to the 200th number (100/100)*200=200

I have sometimes heard of there being 99 percentiles though..

Google shows the oxford dictionary that says of percentile- "each of the 100 equal groups into which a population can be divided according to the distribution of values of a particular variable." and "each of the 99 intermediate values of a random variable which divide a frequency distribution into 100 such groups."

Wikipedia says "the 20th percentile is the value below which 20% of the observations may be found" But does it actually mean "the value below or equal to which, 20% of the observations may be found" i.e. "the value for which 20% of the values are <= to it". If it were just < and not <=, then By that reasoning, the 100th percentile would be the value below which 100% of the values may be found. I have heard that as an argument that there can be no 100th percentile, because you can't have a number where there are 100% of the numbers below it. But I think maybe that argument that you can't have a 100th percentile is incorrect and is based an error that the definition of a percentile involves <= not <. (or >= not >). So the hundredth percentile would be the final number and would be >= 100% of the numbers.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
barlop
  • 415
  • 4
  • 7
  • 5
    I think it unlikely 100 would be a reasonable answer due to its asymmetric treatment of the extremes. Cases can be made for either 99 (as in the definition you quote) or 101. – whuber Oct 07 '19 at 18:50
  • 5
    Historically quantiles — as we now say generically — were first summary points, and then by extension the bins, classes or intervals they delimit. So three quartiles, including the median, define four bins, and so forth. – Nick Cox Oct 07 '19 at 19:40
  • @NickCox do you have a source for that? – barlop Oct 07 '19 at 21:46
  • 1
    @whuber You write "I think it unlikely 100 would be a reasonable answer due to its asymmetric treatment of the extremes." – barlop Oct 07 '19 at 21:47
  • @whuber You write "Cases can be made for either 99 (as in the definition you quote) or 101" – barlop Oct 07 '19 at 21:49
  • Related: [What are some examples of reversed usage of “percentiles”?](https://stats.stackexchange.com/questions/416804/what-are-some-examples-of-reversed-usage-of-percentiles) – Glen_b Oct 07 '19 at 23:07
  • 3
    I list early uses of various quantile terms at https://stats.stackexchange.com/questions/235330/iles-terminology-for-the-top-half-a-percent/235334#235334. If you look within the OED or jstor you will get examples of historical usage. – Nick Cox Oct 07 '19 at 23:45
  • There is a case, perhaps as much facetious as serious, for referring to sample minimum and maximum as e.g. 0 and 100th percentiles, but I don't recommend it. Note that if you are working with percentiles, nothing obliges you to calculate all 99 from 1% to 99% and that would usually only make sense if you had a very large sample, and wanted a fine-grained description, and ties did not bite hard. – Nick Cox Oct 08 '19 at 08:04
  • Search the forum for mentions of a paper by Hyndman and Fan e.g. https://stats.stackexchange.com/questions/367467/is-there-more-than-one-median-formula In practice there are many slightly definitions of quantiles, and thus of the bins they delimit. – Nick Cox Oct 08 '19 at 08:08
  • There is no one "Oxford dictionary". The Oxford University Press publishes several. Which did you consult? – TRiG Oct 08 '19 at 10:19
  • 1
    There can't be 100. Either 99 or 101, depending on whether you count maximum and minimum – David Oct 08 '19 at 10:38
  • The simple answer is: there are 100 percentiles, 0th thru 99th. Yes there is ambiguous terminology between percentile boundaries and percentile groups. Being in the nth percentile **group** means being **above** the nth percentile **boundary** and **at or below** the (n+1)st percentile **boundary**. Note that there well could be no population member **at** a given percentile **boundary**, so your "pointers to individual numbers" guess is wrong. – Jeff Y Oct 08 '19 at 13:04
  • As such, I suppose one could say that there **is** a 100th-percentile **boundary** but there is **no** 100th-percentile **group**. No wonder there is confusion. Ambiguity does that. – Jeff Y Oct 08 '19 at 13:06
  • @JeffY That is unhelpful, not least in tone. To peel off just one misleading assertion: the first quartile bin lies below the first or lower quartile, if only as conventional terminology. – Nick Cox Oct 08 '19 at 13:10
  • @JeffY do you agree that if you have quartiles then you have four groups. see this diagram https://i.imgur.com/dN3hSwq.png You have 0-25, 25-50, 50-75, 75-100 (and the interquartile range is 25-75. The range between the two quarters in the middle). So I guess you agree that with quartiles you have 4 groups? So why would you not then agree / how can you not then agree, that when dealing with percentiles you have 100 groups? – barlop Oct 08 '19 at 13:33
  • @barlop I **do** agree that there are 100 groups, the 0th through the 99th. I am only saying there is no 100th group -- no population member has 100% of the population **below** them. There seems to be some additional possible terminology disagreement/ambiguity between quartiles and percentiles. – Jeff Y Oct 08 '19 at 13:40
  • A bin doesn't have to be defined by an upper limit. Here is one: "greater than 42". Definitions just have to make unambiguous which values go in which bin. The general idea that $k$ points define $k + 1$ bins rules here, while leaving ties on one side together with some convention about points at each limit going up or down. . – Nick Cox Oct 08 '19 at 13:55
  • @NickCox that is a good point that a bin doesn't have to have an upper limit.. it could have a lower limit and no upper limit, or I suppose vice versa, an upper limit and no lower limit. I suppose a point defines one boundary of a bin, rather than the bin / whole bin. You need more than the point to define the bin e.g. not just 42 but >=42 then you have the whole bin. – barlop Oct 08 '19 at 14:15
  • @JeffY also if you agree that quartile groups are going to be the same size, just as percentile groups are, then you'd agree that between the lower and upper quartiles are "two middle quartiles". Hence the IQR is the range between two quartiles, not a quartile in itself. It's the size of two quartiles (50%). – barlop Oct 08 '19 at 14:16
  • I don't think anyone is arguing about the IQR. Best leave it on one side. – Nick Cox Oct 08 '19 at 14:29
  • @NickCox Well, my point is that even though we don't normally hear about 2 quartiles in the middle, between the upper and lower quartiles, they're there, and my point is that the same principle applies. to quartiles as percentiles which is that if talking about groups, then just as there are 4 quartile groups , there are 100 percentile groups. So that totally counters Jeff's statement about there being no 100th group. – barlop Oct 08 '19 at 14:31
  • @NickCox Quantiles are not usually that general are they (no upper limit, not indexed to a fraction of population size)? Or did you mean "greater than 42%"? (Which entails <=100% so there is in fact an upper limit?) – Jeff Y Oct 08 '19 at 14:31
  • @JeffY I am just rebutting an assumption of yours that an upper group can't be identified if an upper limit can't be identified. "greater than 42" is just an example with 42 as an arbitrary number: nothing to do with quantiles or percents as such. The point is mathematical: $(42, \infty)$ is a bin. – Nick Cox Oct 08 '19 at 14:35
  • barlop: I agree that the logic here is general, regardless of which number of quantiles is being discussed. They don't even to be equally spaced on a cumulative probability scale. So the 10, 25, 50, 75, 90% points are 5 points defining 6 bins. And so on and so forth. – Nick Cox Oct 08 '19 at 14:39
  • @NickCox Ahh. I think you misunderstand me then. There is an upper group, it's just its name we are disagreeing on I think. I am saying that the "99th-percentile group" is the upper group (and the "0th-percentile group" is the lower group). I.e. above the 99%-percentile boundary and at or below the 100%-percentile "boundary". – Jeff Y Oct 08 '19 at 14:43
  • @JeffY You are making an error that some with an IT background make. If you have an array/list of 5 items, it'd be an error to say there is no 5th item. The indexing is 0..4 but if you are going to look at the count of the number of items, it's 5, and the 1st item is the item with index 0. The 5th item is the last one, whose index is 4. You can't say that an array of 5 items has no 5th item. The phrase "0th item" is a bit of a nonsense and has confused you into thinking there is no 5th item in an array with 5 items. And you've applied that same error here in this statistics case. – barlop Oct 08 '19 at 14:49
  • You may have a consistent way to think about this but it would be better spelled out in an answer. Your opening comment seemed clear enough but your terminology there I can only regard as non-standard. – Nick Cox Oct 08 '19 at 14:49
  • @barlop I am going by the definition of "being in the nth percentile" as meaning "(just) above n% of the population". By that definition, there is no "being in the 100% percentile". I.e. there is no "100th percentile group/bin". – Jeff Y Oct 08 '19 at 14:55
  • 1
    @Jeff As I requested of mkt (who posted the first answer here), I ask of you: can you cite an authoritative source for that definition? There very well may be, but I suspect it is not to be found in the statistical literature, because the vast majority of that literature adopts a different convention for describing the distribution: namely, we prefer to work with the chance that a random variable is *less than or equal* to some quantity. – whuber Oct 08 '19 at 14:58
  • 2
    @whuber Yes, it appears that what I am referencing is properly called "percentile rank", used in test-score reports & c.: https://en.wikipedia.org/wiki/Percentile, https://en.wikipedia.org/wiki/Percentile_rank, http://www.ncme.org/resources/glossary. Apologies for adding to confusion. In my defense, the difference appears to hinge on usage of the prepositions "at" vs. "in" (see 1st link). – Jeff Y Oct 08 '19 at 15:46
  • @barlop I see most sites / images speak of quartiles as points. 3 quartiles. Q1 Q2 Q3. The terms "upper quartile" and "lower quartile" are used.. but it's rarer for the middle two regions to be referred to as quartiles, but are sometimes referred to - "lower middle quartile", upper middle quartile". . And sometimes each group is referred to as each of four quarters. – barlop Oct 08 '19 at 22:45
  • @JeffY here they say "in" but mean "at". https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/percentiles-rank-range/ "If you know that your score is in the 90th percentile, that means you scored better than 90% of people who took the test." Interestingly https://en.wikipedia.org/wiki/Percentile says "every score is in the 100th percentile" so sees percentiles as overlapping, smaller within larger. Unlike "lower quartile" and "upper quartile" that are equal sized regions. – barlop Oct 08 '19 at 22:45
  • @JeffY got it, above. As a student, there were standardized tests that showed percentile. On math, I was devastated to have a score showing 99th percentile. Until it was explained to me that this actually was the highest one could get. (Other scores were 'normal', English was 75-80, who even remembers?) – JTP - Apologise to Monica Oct 09 '19 at 00:50
  • 1
    @JoeTaxpayer Those who told you that were being consistent if and only if they told people in the bottom bin that they were in the 0th percentile. I can't know if they did, and presumably you didn't hang out with any likely candidates. – Nick Cox Oct 09 '19 at 07:18

5 Answers5

34

Both of these senses of percentile, quartile, and so on are in widespread use. It’s easiest to illustrate the difference with quartiles:

  1. the “divider” sense — there are 3 quartiles, which are the values dividing the distribution (or sample) into 4 equal parts:

       1   2   3
    ---|---|---|---
    

    (Sometimes this is used with max and min values included, so there are 5 quartiles numbered 0–4; note this doesn’t conflict with the numbering above, it just extends it.)

  2. the “bin” sense: there are 4 quartiles, the subsets into which those 3 values divide the distribution (or sample)

     1   2   3   4
    ---|---|---|---
    

Neither usage can reasonably be called “wrong”: both are used by many experienced practitioners, and both appear in plenty of authoritative sources (textbooks, technical dictionaries, and the like).

With quartiles, the sense being used is usually clear from context: speaking of a value in the third quartile can only be the “bin” sense, while speaking of all values below the third quartile most likely means the “divider” sense. With percentiles, the distinction is more often unclear, but it’s also not so significant for most purposes, since 1% of a distribution is so small — a narrow strip is approximately a line. Speaking of everyone above the 80th percentile might mean the top 20% or the top 19%, but in an informal context that’s not a major difference, and in rigorous work, the meaning needed should be presumably clarified by the rest of the context.

(Parts of this answer are adapted from https://math.stackexchange.com/questions/1419609/are-there-3-or-4-quartiles-99-or-100-percentiles, which also gives quotations + references.)

PLL
  • 478
  • 3
  • 6
  • 3
    (+1) This late answer nicely gets to the heart of the matter. – Nick Cox Oct 09 '19 at 08:55
  • what about https://en.wikipedia.org/wiki/Percentile says "every score is in the 100th percentile" – barlop Oct 10 '19 at 02:49
  • 3
    The Wikipedia entry does say that. I can't think of a defence for such wording. Wikipedia is wonderful, except when it is misleading or wrong. That will sound flippant, but all that I can do is encourage anyone watching who is active on Wikipedia to improve the entry. Everyone has to have rules for what they do and don't do, and being active here and in a few other places is my personal limit. – Nick Cox Oct 10 '19 at 10:22
5

Take this answer with a grain of salt -- it started out fairly wrong and I am still deciding what to do with it.

The question is partly about language and usage, whereas this answer focuses on mathematics. I hope that the mathematics will provide a framework for understanding different usages.

One nice way to treat this is to start with simple math and work backwards to the more complicated case of real data. Let's start with PDF's, CDF's, and inverse CDF's (also known as quantile functions). The $x$th quantile of a distribution with pdf $f$ and cdf $F$ is $F^{-1}(x)$. Suppose the $z$th percentile is $F^{-1}(z/100)$. This provides a way to pin down the ambiguity you identify: we can look at situations where $F$ is 1) not invertible, 2) only invertible on a certain domain, or 3) invertible but its inverse never attains certain values.

Example of 1): I'll leave this for last; keep reading.

Example of 2): For a uniform 0,1 distribution, the CDF is invertible when restricted to [0, 1], so the 100th and 0th percentiles could be defined as $F^{-1}(1)$ and $F^{-1}(0)$ given that caveat. Otherwise, they are ill-defined since $F(-0.5)$ (for example) is also 0.

Another example of 2): For a uniform distribution on the two disjoint intervals from 0 to 1 and 2 to 3, the CDF looks like this.

enter image description here

Most quantiles of this distribution exist and are unique, but the median (50th percentile) is inherently ambiguous. In R, they go half-way: quantile(c(runif(100), runif(100) + 2), 0.5) returns about 1.5.

Example of 3): For a normal distribution, the 100th and 0th percentiles do not exist (or they "are" $\pm \infty$). This is because the normal CDF never attains 0 or 1.

Discussion of 1): For "nice" cdf's, such as with non-extreme quantiles or continuous distributions, the percentiles exist and are unique. But for a discrete distribution such as the Poisson distribution, my definition is ambiguous because for most $z/100$, there is no $y$ with $F(y) = z/100$. For a Poisson distribution with expectation 1, the CDF looks like this.

enter image description here

For the 60th percentile, R returns 1 (quantile(c(rpois(lambda = 1, n = 1000) ), 0.60)). For the 65th percentile, R also returns 1. You can think of this as drawing 100 observations, ranking them low to high, and returning the 60th or 65th item. If you do this, you will most often get 1.

When it comes to real data, all distributions are discrete. (The empirical CDF of runif(100) or np.random.random(100) has 100 increments clustered around 0.5.) But, rather than treating them as discrete, R's quantile function seems to treat them as samples from continuous distributions. For example, the median (the 50th percentile or 0.5 quantile) of the sample 3,4, 5, 6, 7, 8 is given as 5.5. If you draw 2n samples from a unif(3,8) distribution and take any number between the nth and (n+1)th sample, you will converge on 5.5 as n increases.

It's interesting to also consider the discrete uniform distribution with equal probability of hitting 3,4,5,6,7,8. (A die roll plus two.) If you take the sample-and-rank approach outlined above for the Poisson distribution, you will usually get 5 or 6. As the samples get bigger, the distribution for the number halfway up will converge on half fives and half sixes. 5.5 seems like a reasonable compromise here too.

eric_kernfeld
  • 4,828
  • 1
  • 16
  • 41
  • 2
    Your first paragraph has some incorrect information: $F^{-1}$ is indeed unique in many cases, *including* for the uniform distribution on $[0,1]$ (when $F$ is restricted to $[0,1]$ itself). This has little to do with $F$ being "constant." I think you are making misleading arguments that mix up the roles of *continuity,* *invertibility,* and *boundedness of support* of distributions. Introducing estimators and referring to them also as "quantiles" is interesting but threatens to make things even more confusing. – whuber Oct 07 '19 at 19:46
  • Good point. I have tried to separate out some cases to clarify that. How would you improve the discussion of continuity? The interpretation of quantiles as estimators is the central point of my answer; they don't really make sense to me without that. – eric_kernfeld Oct 08 '19 at 13:08
  • Re the latter: quantiles don't need to estimate anything. They are useful in their own right for describing and visualizing data (and often are used only as descriptive statistics). Re continuity: I think most authorities would say that all percentiles exist for discrete distributions. Insisting otherwise is an unnecessary complication. It would also render the results of most software calculations utterly mysterious, which happily provide all quantiles from 0 through 1 (*inclusive*) for any dataset. In `R`, for instance, type `quantile(0)`. – whuber Oct 08 '19 at 13:12
  • This discussion has made me realize that I do not understand quantiles of discrete distributions. I think I should delete this answer. – eric_kernfeld Oct 08 '19 at 13:30
  • What's the best etiquette here? Delete the answer, or put a disclaimer and leave a record of it? – eric_kernfeld Oct 08 '19 at 13:50
  • 1
    People vary about this, Eric. When my answers are so wrong as to be misleading, I first delete them. If I see some potential value in part of the answer I edit it to remove (or explain) the misleading part and then undelete it. Others just let things stand and take their lumps in the voting; others add an edit suggesting there may be value in readers seeing where some misunderstanding might have occurred; yet others just delete. You can even completely change the answer if you like, as is sometimes done. – whuber Oct 08 '19 at 13:55
  • i'd suggest leaving it as it is for now(you've mentioned that it has some issues with it and you realised you don't fully/really understand it, so that's good that you mentioned that), and if in future or over time, your understanding improves then come back to it and improve it, and in time it could evolve into a better answer. No need to delete or it or rush it. – barlop Oct 09 '19 at 20:47
2

I was taught that an observation in the nth percentile was greater than n% of observations in the dataset under consideration. Which to me implies that there is no 0th or 100th percentile. No observation can be greater than 100% of observations because it forms part of that 100% (and a similar logic applies in the case of 0).

Edit: For what it's worth, this is also consistent with non-academic usage of the term that I've encountered: "X is in the nth percentile" implies that the percentile is the group, not a boundary.

I unfortunately have no source for this that I can point you to.

mkt
  • 11,770
  • 9
  • 51
  • 125
  • 6
    Do you have an authoritative reference for what you remember being taught? Note that you are implicitly adopting a definition of "percentile" as being a *group* of numbers. The other definition quoted in the question is that the percentile is a *boundary* between such groups. – whuber Oct 07 '19 at 18:51
  • @whuber Unfortunately not. And yes, I see the distinction. – mkt Oct 07 '19 at 18:51
  • 1
    That doesn't make sense to me because suppose your data is 2,2,2,2,2,2,2,2,2,2,2 so an item in one quantile is equal to an item to its left in a prior quantile. So an item in the nth quantile is not greater than all quantiles left of it. So an item in the nth percentile is not greater than n% of observations in the dataset. It's >= n% of observations in the dataset, but not simply >. And hence you can have a 100th pecentile.. what do you make of that logic? – barlop Oct 07 '19 at 21:45
  • 4
    Many definitions come under strain if all values are identical! – Nick Cox Oct 07 '19 at 23:46
  • 1
    @NickCox (one could improve on such flawed definitions then, and I don't see a flaw with the <= or >= definition, even for identical numbers, but that aside). I only made all values identical for ease of coming up with an example,but a good example illustratnig my point needn't have that..For example,you could have a list of 16 values such as these. |1,2,3,4|4,5,6,7|8,9,10,11|12,13,14,15| Notice that not all the values in the first quartile group are less than the values in the second quartile group. But all values in the first quartile group are <= all values in latter quartile groups – barlop Oct 08 '19 at 06:50
  • 1
    @barlop I agree that there's ambiguity and like your question because it raises a point of reasonable disagreement. It is a bit strange if a basic concept is not understood to mean the same thing. Regarding your examples, I agree with Nick Cox for the first one and disagree with your grouping of the quartiles in the second - I would put both 4s in the same quartile. – mkt Oct 08 '19 at 07:18
  • 2
    Those of mathematical bent abstract and idealise while those who write software need to deal with the messiness of data. Your example of 16 values would be treated differently by software I know which follows a rule that identical values must be binned identically (and I agree). I am surprised that you did not agonise over data with 15 or 17 values where even if all values are distinct no rule can divide data into 4 bins of equal size. – Nick Cox Oct 08 '19 at 07:49
  • 3
    What's the similar logic for zero? Doesn't "greater than zero percent of the observations" mean "equal to or smaller than all the observations", i.e. the 0th percentile would be the lowest observed value? – ilkkachu Oct 08 '19 at 12:35
  • @ilkkachu I guess you're right, it's not that similar. It still seems like a meaningless quantity to me. But as you've no doubt seen in the other answers and comments, this is apparently a topic without a clear consensus. – mkt Oct 08 '19 at 12:42
  • @NickCox If you follow a rule that identical numbers have to be put in the same bin, then suppose you have `|1,2,3,4,4,4,4,4,4,4,4,4,4,4,14,15|` how are you going to put that into quartiles? your -identical values must be in the same bin- rule would mean that you'd have to have one large quantile group with a lot more than 25% of the data, that would mean you can't put the data into quartile groups / quarters. – barlop Oct 09 '19 at 20:58
  • @NickCox also, if talking about quantile as points, would you also say identical numbers get the same quantile point? that would be interesting as it means that all data below the X% percentile is below in value. Which would meet what many definitions eg " "the 20th percentile is the value below which 20% of the observations may be found" (that skip >= and <=) says, though I can't see how you can get quantiles corresponding correctly with rank / quantile groups being the right size, with that rule.. moreso the more identical numbers there are. – barlop Oct 09 '19 at 21:02
  • 1
    You can define a unicorn; that does not bring it into existence. Ideal quantile bins that contain equal numbers of distinct values can be frustrated by ties and awkward sample sizes. Why is this a surprise? – Nick Cox Oct 09 '19 at 22:32
  • @NickCox well i'm wondering what you do in that situation..(like can a 50th percentile be quite far from 50% of the way through the data), and i'd be interested to know what software you use(/software you know that puts or tries to put equal values in the same bin), as I might try it in that and see how that handles it – barlop Oct 10 '19 at 02:39
  • 1
    I use Stata. The `xtile` command and all variants on or alternatives to it I know follow the rule that equal values end up in the same bin. There isn't another solution to binning that is coherent, except avoiding it as a bad idea, You can always calculate a plotting position which has a different result for different reasons. See https://www.stata.com/support/faqs/statistics/percentile-ranks-and-plotting-positions/ Sorry, but this is not a real problem in my view, as I keep trying to explain, – Nick Cox Oct 10 '19 at 06:30
2

There are other ways to calculate percentiles, what follows, is not the only one. Taken from this Source.


The meaning of percentile can be captured by stating that the $p$th percentile of a distribution is a number such that approximately $p$ percent ($p\%$) of the values in the distribution are equal to or less than that number. So, if $28$ is the $80$th percentile of a larger batch of numbers, $80$% of those numbers are less than or equal to $28$.

To calculate percentiles, sort the data so that $x_1$ is the smallest value, and $x_n$ is the largest,

with $n$ = total number of observations, $x_i$ is the $p_i$th percentile of the data set where:

$p_i = \dfrac{100(i - 0.5)}{n}$

Example from the same notes for illustration:

enter image description here

To take a single example, $7$ is the $50$th percentile of the distribution, and about half of the values in the distribution are equal to or less than $7$.

If you had 200 numbers, there'd be 100 percentiles, but would each refer to a group of two numbers.

No.

Assuming the numbers are sorted in ascending order moving from $x_1$ to $x_\mathrm{200}$. In this case the percentiles are:

$\dfrac{100(1-0.5)}{200}$, $\dfrac{100(2-0.5)}{200}$, $\dfrac{100(3-0.5)}{200}$, $...$

resulting in

$0.25, 0.75, 1.25 ... $ percentiles corresponding to indices $1, 2, 3, ...$

naive
  • 899
  • 1
  • 9
  • 14
  • 3
    The first sentence looks great, and one of the most important words is _approximately_, Thereafter this is a careful explanation of just one recipe. What's key is that there are several recipes and most if not all have some defensible logic about them (sometimes the logic is to keep things as simple as possible). See the Hyndman and Fan paper referred to in many threads here on CV. I doubt that many people would take your last paragraph as the way to report percentiles for your example. – Nick Cox Oct 08 '19 at 09:25
  • @Nick Cox Thank you for the insightful comment. About the last paragraph I believe the method should work fine when all the observations are different from each other. In case of repeated numbers there will not be unique percentile for the same number which doesn't sound good. Could you kindly suggest how to deal with the case. And could you also point out the potential pitfalls in the last paragraph. – naive Oct 08 '19 at 09:49
  • 1
    I don't think I want or need to add to what is already well explained in journal literature. First, you have some favourite software for this. See what it documents and what it does. Second, I've not calculated percentiles by hand for some decades, and none of us needs to. Third, my point about the last para: I guess no-one wants to be told that the observed data points are the 0.25, 0.75, 1.25, ... percentiles. What people do want varies, but in my experience it's most commonly wanting summaries such as 1, 5, 10, 25, 50, 75, 90, 95, 99% points as well as the sample extremes. – Nick Cox Oct 08 '19 at 09:57
  • 1
    I've just noticed that you assert that 0.5 is in EDA jargon often called the p-value for the median. Not in my reading, and even if you can find examples that is terrible terminology given an overwhelming majority sense for p-value as observed significance level. – Nick Cox Oct 08 '19 at 09:59
  • I will go through the paper that you suggested. Thank you – naive Oct 08 '19 at 09:59
  • I too am skeptical about the line about p-value in the image that i posted. It is from the source i linked to in the answer. I agree with you about the terminology. I do not want to make a statement about p-values anywhere in the answer. – naive Oct 08 '19 at 10:04
  • @NickCox what software do you use? – barlop Oct 09 '19 at 20:50
0

Note- I will accept somebody else's answer rather than mine. But I do see some useful comments so I'm just writing an answer that mentions those.

Based on Nick's answer "-iles" terminology for the top half a percent

it seems that the terms are ambiguous, and I suppose (based on my understanding of that post), better terminology would be X% point, and X%-Y% group; so quantile point(so for quartile points that could be anything from 0 to 4); quantile group ranging from X quantile point to Y quantile point.

Either way one would get 101 for percentiles, although one comment suggests that one could refer to 101 points (I suppose if you counted percentile points, and only integers), but even then, if one speaks of 1st, 2nd, 3rd, percentile or quantile, it's counting and one can't count the first as 0, and you can't have e.g. more than 4 quartiles or more than 100 percentiles. So if talking 1st, 2nd, 3rd, that terminology can't really refer to point 0. If somebody said 0th point, then while it's clear they mean point 0, I think they should really say quantile point 0. Or Quantile group at point 0. Even computer scientists wouldn't say 0th; even they count the first item as 1, and if they call it item 0, that's an indexing from 0, not a count.

A comment mentions "There can't be 100. Either 99 or 101, depending on whether you count maximum and minimum". I think there's a case for 99 or 101, when talking about quantile points rather than groups, though I wouldn't say 0th. For n items, An index may go from 0...n-1 and one wouldn't write th/st e.g. 1st, 2nd etc, on an index(unless perhaps the index happened to index the first item as 1). But an index starting the first item with index of 0 isn't a 1st, 2nd 3rd count. e.g. item with index of 0 is the 1st item, one wouldn't say 0th and label the second item 1st.

barlop
  • 415
  • 4
  • 7
  • Any ambiguity was introduced by those who departed from clear historical precedent. It doesn't bite hard in practice. – Nick Cox Oct 08 '19 at 08:48
  • All mathematicians start counting at zero. The concept is simple and natural: saying the word "zero" out loud announces one's intention to count. Then one makes some (perhaps arbitrary) one-to-one assignment of the sequence of words "one," "two," "three," etc. to the objects being counted. The last of those words (if there is a last) is equated with the cardinality of the set. The beauty of this idea is that when there are no elements in the set, the last word said was "zero," which is the unique correct value. – whuber Oct 08 '19 at 11:32
  • @whuber you write "All mathematicians start counting at zero" – barlop Oct 08 '19 at 12:35
  • "it's counting and one can't count the first as 0". – whuber Oct 08 '19 at 12:48
  • @whuber you don't count the first item as 0 but you do count from 0. You count no items as 0, and one item as 1. For example there's in programming if you have an array of items indexed 0-5 that's 6 items. The 1st item has index 0. You wouldn't say 0th item, and you wouldn't say 1st item when you mean the second in the array. You say the item with index 1 when you mean the second item in the array – barlop Oct 08 '19 at 12:53
  • I believe many people would indeed say the "zeroth item," but that is beside the point. Thank you for clarifying what you meant. – whuber Oct 08 '19 at 12:55
  • 1
    @whuber possibly many might, I think many years ago I might have, as when studying computer science, I heard sometimes that computer scientists count from 0, unilke mathematicians(that's not your claim or mine) , but after some deep thought I got more clarity and realised that computer scientists and mathematicians both count from 0.. The difference is computer scientists often use an index and the index indexes the first item as 0.(but still count would be 1).. – barlop Oct 08 '19 at 13:01
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/99612/discussion-between-barlop-and-whuber). – barlop Oct 08 '19 at 13:01