21

How can I add new variable into data frame which will be percentile rank of one of the variables? I can do this in Excel easily, but I really want to do that in R.

Thanks

ttnphns
  • 51,648
  • 40
  • 253
  • 462
user333
  • 6,621
  • 17
  • 44
  • 54

3 Answers3

33

Given a vector of raw data values, a simple function might look like

perc.rank <- function(x, xo)  length(x[x <= xo])/length(x)*100

where x0 is the value for which we want the percentile rank, given the vector x, as suggested on R-bloggers.

However, it might easily be vectorized as

perc.rank <- function(x) trunc(rank(x))/length(x)

which has the advantage of not having to pass each value. So, here is an example of use:

my.df <- data.frame(x=rnorm(200))
my.df <- within(my.df, xr <- perc.rank(x))
chl
  • 50,972
  • 18
  • 205
  • 364
  • 3
    1. Your function does not mimic Excel's `percentrank`-function, which is good (+1) since the latter gives "strange" results (see my [comparison](https://gist.github.com/1026879)). 2. I wouldn't name the data frame `df`, because `df` is an R function (the density of the F distribution, see `?df`). – Bernd Weiss Jun 15 '11 at 11:04
  • 1
    @Bernd Thanks. (1) There are some built-in functions for computing PR in various psychometrics packages. I think I grabbed this one from the `CTT` package a while ago. I didn't check against Excel because I don't have/use it. About (2) I seem to always forget about this! Let's go with `my.*` (Perl way) :-) – chl Jun 15 '11 at 11:21
  • @chl why is the `trunc` required? It seems rank will always return an integer anyway. – Tyler Rinker May 10 '18 at 18:38
  • 1
    @Tyler Nope. In case there are ties, `rank()` defaults to taking the average of the tied values (cf. `ties.method = c("average",...)`). – chl May 11 '18 at 13:15
  • Beware that NA values should be removed! This can be done by adding `x = x[!is.na(x)]` – Antoine Mar 21 '21 at 16:55
9

If your original data.frame is called dfr and the variable of interest is called myvar, you can use dfr$myrank<-rank(dfr$myvar) for normal ranks, or dfr$myrank<-rank(dfr$myvar)/length(myvar) for percentile ranks.

Oh well. If you really want it the Excel way (may not be the simplest solution, but I had some fun using new (to me) functions and avoiding loops):

percentilerank<-function(x){
  rx<-rle(sort(x))
  smaller<-cumsum(c(0, rx$lengths))[seq(length(rx$lengths))]
  larger<-rev(cumsum(c(0, rev(rx$lengths))))[-1]
  rxpr<-smaller/(smaller+larger)
  rxpr[match(x, rx$values)]
}

so now you can use dfr$myrank<-percentilerank(dfr$myvar)

HTH.

Nick Sabbe
  • 12,119
  • 2
  • 35
  • 43
1

A problem with the presented answer is that it will not work properly, when you have NAs.

In this case, another possibility (inspired by the function from chl♦) is:

perc.rank <- function(x) trunc(rank(x,na.last = NA))/sum(!is.na(x))
quant <- function (x, p.ile) {
      x[which.min(x = abs(perc.rank(x-(p.ile/100))))]
}

Here, x is the vector of values, and p.ile is the percentile by rank. 2.5 percentile by rank of (arbitrary) coef.mat may be calculated by:

quant(coef.mat[,3], 2.5)  
[1] 0.00025  

or as a single function:

quant <- function (x, p.ile) {
   perc.rank <- trunc(rank(x,na.last = NA))/sum(!is.na(x))
   x = na.omit(x)
   x[which.min(x = abs(perc.rank(x-(p.ile/100))))]
}
Farshad
  • 11
  • 2