23

I need to calculate the cumulative distribution function of a data sample.

Is there something similar to hist() in R that measure the cumulative density function?

I have tries ecdf() but i can't understand the logic.

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
emanuele
  • 2,008
  • 3
  • 21
  • 34

4 Answers4

32

The ecdf function applied to a data sample returns a function representing the empirical cumulative distribution function. For example:

> X = rnorm(100) # X is a sample of 100 normally distributed random variables
> P = ecdf(X)    # P is a function giving the empirical CDF of X
> P(0.0)         # This returns the empirical CDF at zero (should be close to 0.5)
[1] 0.52
> plot(P)        # Draws a plot of the empirical CDF (see below)

enter image description here

If you want to have an object representing the empirical CDF evaluated at specific values (rather than as a function object) then you can do

> z = seq(-3, 3, by=0.01) # The values at which we want to evaluate the empirical CDF
> p = P(z)                # p now stores the empirical CDF evaluated at the values in z

Note that p contains at most the same amount of information as P (and possibly it contains less) which in turn contains the same amount of information as X.

Chris Taylor
  • 3,432
  • 1
  • 25
  • 29
1

What you appear to need is this to get the acumulated distribution (probability of get a value <= than x on a sample), ecdf returns you a function, but it appears to be made for plotting, and so, the argument of that function, if it were a stair, would be the index of the tread.

You can use this:

acumulated.distrib= function(sample,x){
    minors= 0
    for(n in sample){
        if(n<=x){
            minors= minors+1
        }
    }
    return (minors/length(sample))
}

mysample = rnorm(100)
acumulated.distrib(mysample,1.21) #1.21 or any other value you want.

Sadly the use of this function is not very fast. I don't know if R has a function that does this returning you a function, that would be more efficient.

Casas
  • 11
  • 1
  • 3
    You seem to mix up the ECDF with its inverse. `R` does, indeed, compute the ECDF: its argument is a potential value of the random variable and it returns values in the interval $[0,1]$. This is readily checked. For instance, `ecdf(c(-1,0,3,9))(8)` returns `0.75`. A generalized inverse of the ECDF is the quantile function, implemented by `quantile` in `R`. – whuber Jun 01 '15 at 16:19
1

friend, you can read the code on this blog.

sample.data = read.table ('data.txt', header = TRUE, sep = "\t")
cdf <- ggplot (data=sample.data, aes(x=Delay, group =Type, color = Type)) + stat_ecdf()
cdf

more details can be found on following link:

r cdf and histogram

Rudy Yuan
  • 3
  • 2
1

I always found ecdf() to be a little confusing. Plus I think it only works in the univariate case. Ended up rolling my own function for this instead.

First install data.table. Then install my package, mltools (or just copy the empirical_cdf() method into your R environment.)

Then it's as easy as

# load packages
library(data.table)
library(mltools)

# Make some data
dt <- data.table(x=c(0.3, 1.3, 1.4, 3.6), y=c(1.2, 1.2, 3.8, 3.9))
dt
     x   y
1: 0.3 1.2
2: 1.3 1.2
3: 1.4 3.8
4: 3.6 3.9

CDF of a vector

empirical_cdf(dt$x, ubounds=seq(1, 4, by=1.0))
   UpperBound N.cum  CDF
1:          1     1 0.25
2:          2     3 0.75
3:          3     3 0.75
4:          4     4 1.00

CDF of column 'x' of dt

empirical_cdf(dt, ubounds=list(x=seq(1, 4, by=1.0)))
   x N.cum  CDF
1: 1     1 0.25
2: 2     3 0.75
3: 3     3 0.75
4: 4     4 1.00

CDF of columns 'x' and 'y' of dt

empirical_cdf(dt, ubounds=list(x=seq(1, 4, by=1.0), y=seq(1, 4, by=1.0)))
    x y N.cum  CDF
 1: 1 1     0 0.00
 2: 1 2     1 0.25
 3: 1 3     1 0.25
 4: 1 4     1 0.25
 5: 2 1     0 0.00
 6: 2 2     2 0.50
 7: 2 3     2 0.50
 8: 2 4     3 0.75
 9: 3 1     0 0.00
10: 3 2     2 0.50
11: 3 3     2 0.50
12: 3 4     3 0.75
13: 4 1     0 0.00
14: 4 2     2 0.50
15: 4 3     2 0.50
16: 4 4     4 1.00
Ben
  • 1,612
  • 3
  • 17
  • 30