3

Someone poured marked balls in my urn! Simplistically, I think this is a capture-recapture problem where, after drawing and marking balls from the urn, somebody added an unknown number (approx 25% of the original draw) of marked balls before the second drawing was taken. (feel free to suggest other approaches if appropriate)

The actual scenario I have is that I have two estimates for the number of events. One count is derived solely from news reports, while the other is derived independently from both news reports and government records. (they're deaths meeting specific criteria, and both news and government records are incomplete, and both estimates have an incomplete count of the total covered by those sources).

Thus, the total set $N$ is unchanged, but capture $n$ is drawn from subset $A$ and capture $m$ drawn from $A\cup B$. We can assume $A$ and $B$ each are about 80% of $N$ and $A\cup B$ is maybe 95% of $N$ (very rough estimates).

If both were derived from news reports, this would be a (somewhat) straightforward capture-recapture problem. However, by adding a second data source it's now thrown off the tallies, because the bias of what's newsworthy is different from the bias of what's recordable (to the government).

The data is broken down by month, and I know one month has been exhaustively searched (at great expense), and I need to estimate the rest of the year.

So, say for one month (exhaustively searched):

$n=100$, (news sources)

$m=150$ (news + government)

$k=75$ (overlap)

Then for another month (non-exhaustive), $n=150, m=100, k=90$.

How should I estimate $N$ for the second month?

djstat
  • 113
  • 4

1 Answers1

1

I do not know if I understand well. However, here are some impressions:

From a Capture-Recapture (C-R) perspective, your first issue here is that you have two different sources ($A$ and $B$, news and government) with slightly different target populations. That should not be a problem as:

  • Many C-R models can be legitimately used in situations where the capture probabilities in each sampling occasion are different
  • C-R models are often used in situations where the target populations of each sample are slightly different. I have never found a mathematical justification for this, but I think the estimates are considered to be robust in those cases

The second issue is that your second sample comes from a compound study(?) with both news and government captures($A \cup B$). Here things are more complicated. Capture probabilities for the second sample are non-homogeneous among the units. (I am assuming that you cannot single out the source of each capture in the second sample, otherwise a three lists model would be ideal). Models accounting for individual (within source) non-homogeneity do exist, but they require many sampling occasions, and/or unit specific covariates for estimating individual capture probabilities. However, according to "The applications of capture-recapture models to epidemiological data. Chao et al. Statistics in Medicine (2001)", in a two samples setting, even if one sample is "highly selective or extremely heterogeneous", "the usual Petersen estimator is valid" as long as the other sample is random (i.e. has homogeneous capture probabilities).
That being said, chances are that Lincoln Petersen works well even in your situation. So, my suggestion here would be to run some simulations of your scenario, under your hypothesis

$A$ and $B$ each are about 80% of $N$ and $A \cup B$ is maybe 95% of $N$

so that, with known $N$, you can evaluate the performance of Lincoln Petersen, to test if it matches your desired level of accuracy.

ruggero
  • 411
  • 2
  • 8