Labeling a pool of unlabelled samples iteratively

Question

Problem setting

I'm faced with a problem in which we have a large set of data points (100K), all of which are still unlabelled. These are to be used as input to a binary classifier at a later point in time. Since sampling is very costly, we need to select a small subset of this pool that is most informative/representative for the dataset and hopefully also for classification.

We could probably do a few (say 3) iterations of sampling (say 200 at first, 1000 later, 1000 in a third phase). It is my understanding that pool-based active learning is an ideal solutions for such scenarios once we have some labels, but from what I've read, it doesn't tackle the first iteration when no labels are known.

So my questions is: how would you select the first 200 samples to label?

What I've looked at:

Full factorial design of experiment: i.e., sample uniformly from all dimensions (alas, this is not possible because the 100+ dimensions prohibit this)
Picking a random starting point and selecting whichever point is most informative next according to some predefined criterion (entropy/information gain/...)

Then again, it is probably smarter to get those 200 points at once. Does anyone know the name of such a technique which selects "the best representation we can have of this dataset" based on a subset k?

Use [Rényi entropy](http://en.wikipedia.org/wiki/R%C3%A9nyi_entropy) or something similar. Probably also useful to have a look at how Nyström approximation is used for essentially similar goals, for instance [here](http://www.cs.yale.edu/homes/mmahoney/pubs/kernel_JMLR.pdf) and [here](http://www.esat.kuleuven.be/sista/ROKS2013/files/abstracts/RaghvendraMall.pdf). — Marc Claesen, Jan 08 '15 at 08:25

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

To answer my own question, the optimal way to pick an initial sample according to information criteria such as entropy is a notorious problem called maximal entropy sampling. This turns out to be NP-hard, so I will probably select a small uniform sample of the data and then try to apply maximal entropy sampling afterwards.

For approximations, this post seems to give some pointers as well (though their proposes sample size is huge and probably not applicable to my scenario).

Labeling a pool of unlabelled samples iteratively

1 Answers1