3

DNS resolution can sometimes return one of multiple IP addresses, for load balancing. I would like to enumerate a list of IPs for a service so I can whitelist traffic to a domain without performing an excessive amount of reverse lookups. How many times should I receive a repeat record before I stop, to have a high probability of enumerating the whole collection?

More formally, I have a fixed set of unknown cardinality and can select only randomly (assume an equal probability of each element being returned). How should I compute when to stop sampling?

There should be a formula with a tuneable confidence level, but I've not found it by searching yet. I appear to be searching for the wrong kinds of things ("unknown sample size, how many samples to saturate", "enumerate set unknown cardinality", etc). Set enumeration via random selection seems to me a fairly general problem.

Iiridayn
  • 141
  • 6
  • 1
    I wonder if the DNS problem is essentially the same as the waterbuck problem. Perhaps you could review these questions https://stats.stackexchange.com/search?q=waterbuck and explain whether DNS lookup is similar or different in terms of its probabilistic characterization. – Sycorax Sep 02 '20 at 03:35
  • Sounds like a sequential version of a mark and recapture problem. – Glen_b Sep 02 '20 at 06:01
  • 1
    @Glen_b not sure how I would generalize Lincoln–Petersen/Chapman/Baysian - while they can be used to estimate the population size with two visits, if that estimate is high it doesn't provide a sufficient stopping criteria. I'm trying to collect a census of the population instead of estimating from a sample. – Iiridayn Sep 02 '20 at 19:52
  • 1
    That's where the "sequential" part comes in. – Glen_b Sep 04 '20 at 16:05

1 Answers1

1

This is probably imprecise, but I believe this is probably accurate - establishes a lose upper bound.

We begin by assuming there is exactly 1 item we've not yet seen in the population - we've seen $n$, so we assume there are $n+1$. We can calculate the odds of what we've currently seen given that assumption. Once those odds reach a threshold (say, 0.05), we can reject the hypothesis that there is a missing item with that confidence.

If we've $n$ items from $m$ samples, we'd get the probability of missing an item as $\alpha=(n/(n+1))^m$.

If we've seen 3 items in 5 samples, this gives a probability of 0.237 that there is a 4th item. If we continue seeing only 3 items in 11 samples, the probability of a 4th item drops to 0.042. At a commonly accepted false positive rate of 0.05, we can stop sampling after 11 samples seeing only 3 items. Similarly, if we've found 50 items, we'd need 152 samples of them to be 95% confident there is not a 51st.

At the $\alpha=0.05$ level, this simplifies to a heuristic of an average of approximately 3 samples per item. Solving for $m$, $m=-(\log(1/0.05)/\log(n/(n+1)))$. $\log(1/0.05) \approx 3$, and $-1/\log(n/(n+1)) \approx n$; thus, $3n\approx m$ at $\alpha = 0.05$.

An open issue. Can we tighten this bound by factoring in the odds of $n+2$ through $n+\infty$, or would their contribution be insignificant?

Iiridayn
  • 141
  • 6
  • Quick note. Need to update alpha to sum from +1 to +inf. (3/4)^5=0.237, (3/4)^5+(3/5)^5=0.315; seems right. Though, by induction there will eventually be only one left. – Iiridayn May 05 '21 at 08:39