Problem setting
I'm faced with a problem in which we have a large set of data points (100K), all of which are still unlabelled. These are to be used as input to a binary classifier at a later point in time. Since sampling is very costly, we need to select a small subset of this pool that is most informative/representative for the dataset and hopefully also for classification.
We could probably do a few (say 3) iterations of sampling (say 200 at first, 1000 later, 1000 in a third phase). It is my understanding that pool-based active learning is an ideal solutions for such scenarios once we have some labels, but from what I've read, it doesn't tackle the first iteration when no labels are known.
So my questions is: how would you select the first 200 samples to label?
What I've looked at:
- Full factorial design of experiment: i.e., sample uniformly from all dimensions (alas, this is not possible because the 100+ dimensions prohibit this)
- Picking a random starting point and selecting whichever point is most informative next according to some predefined criterion (entropy/information gain/...)
Then again, it is probably smarter to get those 200 points at once. Does anyone know the name of such a technique which selects "the best representation we can have of this dataset" based on a subset k?