PAC learning theory and lower bound on the amount of input samples

Question

I am trying to answer the following question: "How much (binary) data do I need for my learner to have seen every variable of the dataset at least once?" In my set-up I am feeding my algorithm binary vectors (i.e. with all elements equal to either 1 or 0), these vectors have a known 'density' (average amount of ones) which - for the purpose of answering this question - are uniformly constant (ok) or follow a long tailed distribution (better). I have tried looking at it from the perspective of combinatorics but this was harder than expected. I suppose this question must have been asked before, but I have not been able to find any references so far.

In "A theory of the Learnable" by Valiant, I read that:

Let L(h,S) be the smallest integer such that in $L(h,S)$ independent Bernoulli trials each with probability at least $h^{-1}$ of success, the probability of having fewer than $S$ successes is less than $h^{-1}$. [...] PROPOSITION: For all integers $S > 1$ and all real $h > 1$. $$L(h,S) \leq 2 h (S + \ln (h))$$

This can be translated to an upper bound for my question given that each feature is assumed to be drawn from an independent Bernoulli trial, but not a lower bound. Does anyone know of other related work that could point me towards a lower bound?

Hi, I have updated the question and added a reference to the original manuscript, hopefully it will be clearer now. — ciri, Sep 12 '13 at 20:40

ciri · Accepted Answer · 2013-09-24T13:48:46.137

To answer my own question, it is easy to get a lower bound when you assume that all variables are uniformly distributed. Then the probability of this event (let's call it A) becomes:

$$ P(A) = 1 - P (X_1 = 0, X_2 = 0, \ldots, X_n = 0) \\ = 1 - \prod P(X_i = 0) \quad \quad \quad \quad \quad\\ = 1 - \left[ \binom{n}{k}\,\theta^{k} (1-\theta)^{n-k} \right]^m \quad\\ = \cdots \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad $$

The solution for non-uniform distributions can be found by compounding the Bernoulli distribution with an a priori Beta distribution.

PAC learning theory and lower bound on the amount of input samples

1 Answers1