3

I have the following problem in case someone has an idea about how to solve this.

Assume three experiments that refer to the same population for a random variable $X$.

In the first experiment, I observe samples in $x=\{1,2,3\}$ (frequencies $n_i^1$, $i=\{1,2,3\}$) but no higher values due to (right) censoring. In the second experiment, I observe samples in $x=\{1,2,3,4,5\}$ (frequencies $n_i^2$), and, finally, in the third one I observe samples in $x=\{1,\ldots,10\}$ (frequencies $n_i^3$).

I would like to obtain the empirical distribution function of the population in $\{1,\ldots, 10\}$ by making use of all the information. What would be a good way of combining the frequencies from all the experiments?

Firebug
  • 15,262
  • 5
  • 60
  • 127
user90772
  • 681
  • 1
  • 8
  • 19
  • Can you tell e.g. between '3' & '3 or more' in the first experiment? Or are all values of '3' equivalent to 'more than 2'? – Scortchi - Reinstate Monica Oct 06 '15 at 11:01
  • Thank you. I do have a frequency which is '>3' as well in the first one, '>5' in the second one and '>10' in the third one and actually these frequencies are much larger than what I have in 1...10 so I would like to focus in the range [1,10] – user90772 Oct 06 '15 at 11:29
  • Maximum likelihood is easy to apply also in the censored case! – kjetil b halvorsen Oct 06 '15 at 13:20
  • Thank you @kjetilbhalvorsen. I am not a statistician and quite new in the field. Could you give me some references (maybe some R package to play with) for non-parametric density estimation under right-censoring using ML? Is the fact that most of my data is censored affecting the estimation? – user90772 Oct 06 '15 at 13:36
  • I will try to write an answer, wait a little bit ... – kjetil b halvorsen Oct 06 '15 at 13:37
  • 1
    Closely related questions whose answers can be applied to address this one include http://stats.stackexchange.com/questions/23860, http://stats.stackexchange.com/questions/60256, and http://stats.stackexchange.com/questions/34882. – whuber Oct 06 '15 at 13:45
  • Thank you @whuber , I am still trying to understand the connection, probably need to think harder – user90772 Oct 07 '15 at 08:40
  • I did not understand the connections either ... – kjetil b halvorsen Oct 07 '15 at 17:52
  • @Kjetil These are aggregated ("binned") data. The links I provide give detailed methods to estimate distributions (or equivalently, any of their properties or parameters) from such data. Your answer here is a specific application of the methods described in each of those other threads. – whuber Oct 07 '15 at 19:54

2 Answers2

3

If I have understood this correctly, you are sampling on three occasions from the same population, defined by a distribution on the naturals, a count distribution. Let that distribution be defined by $P(X=i)=\pi_i, i=0,0,2,\dots$.

The counts n the first occasion are $N_{1j}, j=1,2,3$, where $N_{11}$ is the number of ones observed, and son on, while the last count $N_{13}$ counts three or larger. In the same way define the counts on the second and third occasion, where the last count inclueds "... or larger". The we can write the loglikelihood function as $$ \ell = (N_{11}+N_{21}+N_{31}) \log \pi_1 + (N_{12}+N_{22}+N_{32}) \log \pi_2 + (N_{23}+N_{33}) \log \pi_3 + \dots + N_{13} \log(1-\pi_1-\pi_2) + N_{25} \log(1-\pi_1-\pi_2-\pi_3-\pi_4) + N_{3,10}\log(1-\pi_1-\dots -\pi_9) $$ And then this loglikelihoodfunction should be maximized in the unknown parameters. This can be done non-parametrically as written above, or with some parametric model for the $\pi_i$'s. Continuing with the non-parametric case, first we find the partial derivative with respect to $\pi_1$, which is $$ \frac{\partial \ell}{\partial \pi_1} =\frac{N_{11}+N_{21}+N_{31}}{\pi_1} - \frac{N_{13}}{1-\pi_1-\pi_2} - \frac{N_{25}}{1-\pi_1-\dots -\pi_4} - \frac{N_{3,10}}{1-\pi_1-\dots -\pi_9} $$ then other partial derivatives can be calculated in like manner, and the equations solved. I leave that step as an exercise.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
3

This is exactly the case that the Kaplan Meier curves are created from: in fact, they are a generalization of the Empirical Distribution Function that allows right censoring.

If you are familiar with R, they can be fit using the survfit function in the survival package.

Cliff AB
  • 17,741
  • 1
  • 39
  • 84