Probability mass function from multiple datasets with different ranges

Question

Assume 3 datasets (experiments) for the same population of a discrete random variable, dataset 1 has observed values {1, 2, 3, NA} values, dataset 2 {1, 2, 3, 4, 5, NA} and dataset 3 {1, 2, 3, 4, 5, 6, 7, NA}. These are very large datasets (in the order of thousands) but with fewer samples in higher values (e.g. 6,7 here).

I would like to calculate the p.m.f. and the c.d.f. by aggregating the datasets. NA values are due to some unknown uncensoring mechanism (some of the observations are right-censored, some just lost because of technical issues independently of the observation itself).

I tried to get the empirical probabilities for each value by just aggregating all possible datasets, i.e. for value 1 all datasets, for value 5 the last two and for value 6 only the third dataset. My problem is that this does not give a valid p.m.f. (the sum is > 1). Is there any way to normalize or another way to combine these datasets?

score 1 · Answer 1 · answered Aug 08 '16 at 13:43

In terms of normalizing the probability mass function, it should be a relatively simple fix.

Define your current probability mass function $P(X=x)$. Suppose that this sum of all probabilities $\Sigma_{i=1}^n P(X=x_i)$ equals some value $c$, where $c$ in this case is greater than 1. Note that when $\Sigma_{i=1}^n P(X=x_i)=c\neq 1$, this is not a valid probability function.

Define a new probability function $P'(X=x_i)=\frac{P(X=x_i)}{c}$ for all values $x_i$. Because you are normalizing by the sum of probabilities $c$, the new probability function $P'(X=x_i)$ will now sum to 1 and, provided it satisfies all other Kolmogorov Axioms, will be a valid probability function.

(What I described here will actually be applicable for any value of $c\neq 0$, so $c$ need not be greater than 1 in order for this to work.)

As for the missing/lost values, that will take some additional treatment. If you are comfortable proceeding without that lost information and assuming that, for example, the proportion of 1s you see across the three datasets is an accurate portrayal of the true proportion of 1s you expect to see in the aggregated dataset, then this normalization technique is all you need to do. (Note that this should be true for the proportion of 1s, 2s, and so on.)

If you do not believe these values to be missing completely at random (and I strongly doubt that they are MCAR), then you may need to look into survival analysis or missing data imputation to estimate or impute these values before estimating the p.m.f. If you proceed with a p.m.f. ignoring the missingness or the censoring mechanism, then your p.m.f. will likely be an inaccurate representation of the true distribution of the variable of interest. (It follows that if your p.m.f. is an inaccurate representation of the variable of interest, then your c.d.f. will also be an inaccurate representation.)

Thank you Matt. I agree that normalizing with the sum does make the p.m.f. a valid one but not a correct one. The NA values are more than 60% and it is definitely not the case that in the example P(X <= 7) = 1. The data comes from an external system, so it is not easy to understand the censoring mechanism (sometimes values are right-censored, sometimes the observations are lost but we only see a NA value). — user90772, Aug 08 '16 at 15:09

Probability mass function from multiple datasets with different ranges

1 Answers1