3

I am working with a population in which each individual has, among others, 6 observed variables that can be 0 or 1: $X_i \sim Bernoulli(p_i),\ i=1,...,6$ . I know the "true" value for the probability of success for each of the variables $p_1,...,p_6$. However, I do NOT know the dependences between them.

I have a sample from this population, but I can't observe individuals where $x_i=0$ for $i=1,...,6$ and the population size is unknown.

The main problem is that the estimates for the $p_i$ aren't consistent with the known true values. This is due to the sampling procedure which is beyond my control.

I want to resample from the sample I have to have a subsample that is representative i.e. the estimates for $p_1,...,p_6$ are close enough to the known values. I want this so I can infer other population characteristics from other variables, not from the $p_i$.

Is there any way to achieve this?

Xi'an
  • 90,397
  • 9
  • 157
  • 575
ami232
  • 33
  • 5
  • 1
    @Xi'an I don't want to infer the dependence, I just stated the characteristics of my problem. I want to have a representative subsample in order to infer population characteristics from other observed variables. Maybe I should have made this last point clearer. – ami232 Jul 27 '15 at 09:38

1 Answers1

1

If you observe a sample of $(X_1,...,X_6)$ from which the $(0,0,...,0)$'s have been removed, the probability distribution of this sample is the original one divided by $1-p_{0,0,0,0,0,0}$, because of the truncation/censoring. This means that the probability of observing $(a,b,c,d,e,f)$ as a realisation of $(X_1,...,X_6)$ becomes $$\dfrac{p_{a,b,c,d,e,f}}{1-p_{0,0,0,0,0,0}}$$ for all $(a,b,c,d,e,f)\ne (0,0,0,0,0,0)$. Therefore, the probability to observe $X_1=1$ in this censored sample is (with all probabilities referring to the unconstrained model) $$\eqalign{ \sum_{(a,b,c,d,e)\in\{0,1\}^5} \dfrac{p_{1,a,b,c,d,e}}{ 1-p_{0,0,0,0,0,0}} &=\dfrac{\mathbb{P}(X_1=1,X_2,...,X_6\text{ unconstrained})}{1-p_{0,0,0,0,0,0}}\\ &= \dfrac{\mathbb{P}(X_1=1) }{ 1-p_{0,0,0,0,0,0}}\\ &= \dfrac{p_1 }{ 1-p_{0,0,0,0,0,0}}} $$ with similar identities for $X_2,\ldots,X_6$. From those identities, you can derive an estimate of $1-p_{0,0,0,0,0,0}$ by looking at the frequencies of $X_1=1$, $\mathfrak{f}_1$ say, $X_2=1$, $\mathfrak{f}_2$ say, &tc, and estimating $1-p_{0,0,0,0,0,0}$ by $$1-\hat{p}_{0,0,0,0,0,0}=\frac{1}{6}\left\{ \frac{p_1}{\mathfrak{f}_1}+\cdots+\frac{p_6}{\mathfrak{f}_6}\right\}$$(which is biased but convergent). Hence, given this estimate of $p_{0,0,0,0,0,0}$ and an original sample size of $N$, you just have to add the proper proportion of $0,0,0,0,0,0$'s to your original sample, namely $$\dfrac{N}{1-p_{0,0,0,0,0,0}}-N$$ which can be estimated by $$N.\frac{1}{6}\left\{ \frac{\mathfrak{f}_1}{p_1}+\cdots+\frac{\mathfrak{f}_6}{p_6}-6\right\}$$(which is unbiased). Here is a short R code illustrating the approach:

#generate vectors
N=1e5
zprobs=c(.1,.9) #iid example
smpl=matrix(sample(0:1,6*N,rep=TRUE,prob=zprobs),ncol=6)
#remove full zeroes
pty=apply(smpl,1,sum)
smpl=smpl[pty>0,]
#estimated original size
ps=apply(smpl,2,mean)
cor=mean(ps/rep(zprobs[2],6))
length(smpl[,1])*cor

Here is the result for one run:

> length(smpl[,1])*cor
[1] 99995.37

and if I switch to zprobs=c(.9,.1) some runs are as follows:

[1] 99848.33
[1] 100063.3
[1] 100365
[1] 100118.3
[1] 99923.33

Obviously, a slight modification of the R code allows for dependent components as well.

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • Ok, this looks like a great starting point. I'm a bit concerned about the independence assumption of $x_i\stackrel{\text{iid}}{\sim}\mathcal{B}(p_i)\quad i=1,\ldots,6$, but in absence of dependence information I think it's the best we can get, right? – ami232 Jul 27 '15 at 10:38
  • I think I can't use MLE: I would obtain estimates for the sample I have, which isn't representative. – ami232 Jul 29 '15 at 09:42
  • That's not the problem, the problem is with "the estimates for the $p_i$ don't match the ones I know". With this I mean the sample natural estimates are not consistent with the true values. This is due to the way the sample is drawn from the population, which is beyond my control. I should have stated this more clearly, I'll edit the question to fix this. – ami232 Jul 29 '15 at 12:48
  • 1
    I think you got the problem wrong from the beginning, maybe my fault because I might not explain it clearly, but I think that subsequent editions have solved this. Have you read and tried to understand what my actual problem is? If you do, you will realise that I shouldn't run MLE because my sample isn't representative. This isn't due to the fact I can't observe $(0,\dots,0)$ vectors, but to my estimates for the $p_i$ being very different from their true values. This is due to an irregular sampling method. – ami232 Jul 30 '15 at 10:41
  • 2
    If you don't describe your sampling procedure and model (or at least the problem you are trying to solve), there will be no chance whatsoever of you getting a reasonable answer. For example,If you are working with a multiple mark-recapture model (with 0/1 being observation at each of 6 times), with, of course, vectors of all zeros being unobserved, there are numerous models for this situation that depend on assumptions about the population. So we won't be able to help without more details. – AlaskaRon Nov 01 '15 at 21:43