2

I have the sample of a variable $X$ whose distribution is unknown and I would like to know how to estimate the probability of $X$ taking some values. How can I do that? I assume that there's a non parametric method, but I've been unable to find it so far. Could I achieve this with bootstrapping, maybe?

I have a vector of 7453 observations. The variable is discrete, only takes integer values and is bounded by 0 (including it). It can take values in the interval $[0,+\infty)$. They are counts (days until an event happens, but there is no censoring).

Here's a kernel density estimation using density(x) function in R. It looks like a chi squared, but I've performed a ks.test() and rejected the null hypothesis.

enter image description here

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Valde
  • 157
  • 7
  • Can you provide more context? What are these data? Are they bounded by 0? How many data do you have? Are they continuous (in which case, the *probability* of taking any given point value is 0). – gung - Reinstate Monica Aug 19 '15 at 17:05
  • It looks log-normal to me – Uri Goren Aug 19 '15 at 18:05
  • 1
    What is x and what is y . Post the data in columnar format . – IrishStat Aug 19 '15 at 18:18
  • @gung It is a discrete variable which only takes integer values and yes, it is bounded by 0 (including it). It can take values in the interval $[0,+\infty)$. I have a vector of 7453 observations. – Valde Aug 20 '15 at 08:06
  • Are they counts? Where did they come from? Are they associated with a predictor (x) variable? – gung - Reinstate Monica Aug 20 '15 at 14:22
  • @gung Yes, they are counts (days until an event happens). They are not associated with other variables. Isn't there any nonparametrical method to estimate its probabilities? – Valde Aug 20 '15 at 14:30
  • That's helpful information. Is there any censoring? – gung - Reinstate Monica Aug 20 '15 at 15:03
  • @gung No, there is no censoring, because all the observations fullfill the status "event happened". I mean, every individuals have a value, there are not individuals in the sample right censored. – Valde Aug 20 '15 at 15:09

1 Answers1

2

With more than $7,\!000$ observations, you are probably safe to use the proportion of observations at a given value as an estimate of the probability of drawing that value at random from the population. This will probably work fine up until the far right tail of your sample. If you wanted to smooth the estimates, you could use a moving window of, say, $\pm 1$ and re-scale. The downside here is that your last few values will probably not be well estimated, and of course you cannot get probabilities for values beyond your maximum observed value.

Another approach, which amounts to the same thing, is to use the Kaplan-Meier estimator. This will give you the survival function, which is one minus the CDF of your distribution. Subtracting the values from one and differencing them gets you to the same place as above.

Bootstrapping is fine as an addition to the above, but it isn't really a nonparametric estimate of the population probability mass function. Instead, you are taking your sample as an estimate of the population PMF (see here). What bootstrapping will do is let you estimate the uncertainty of your estimated probability from the procedure above. This will probably work reasonably well, but will certainly work less well for those values where you have less data (i.e., the far right tail again).

To extrapolate to probabilities for values that don't show up in your dataset (i.e., x values above your max), you will need to fit a parametric distribution. Even if you did get a good fit, this is still a somewhat sketchy endeavor though, in that you can never know if you used the right distribution. To check the goodness of fit, you can compare the values from the fitted parametric distribution to the values calculated above. If you can live with that uncertainty, you want to look at distributions for count data. The default count distribution is Poisson, but your data are too spread out for that to be viable. The first thing I would look at is the negative binomial distribution, which can handle greater variance and has the advantage of being the distribution of the number of heads before a specified number of failures occurs. That is, it is a distribution of durations for count data, which sounds a lot like your situation. If you use R, ?fitdist in the fitdistrplus package can help you fit distributions like the negative binomial to your data.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Tomorrow I'll try that out and post the results. Thank you a lot for your great answer! – Valde Aug 20 '15 at 20:46