2

I’ve some difficulties understanding how abstention works in Active Learning. A teacher asked me to implement the active learning algorithm Query-by-Committee which helps a committee to ask the better points to the oracle. I’ve understand how the algorithm works but he also asked me to implement QBC where the oracle has the possibility to abstain from giving labels of some inputs.

Actually, I read things about abstention in active learning but I’m unable to put things together. Besides, I find a contradiction with my teacher's tips (that’s certainly because I don’t understand how abstention works).

For me, QBC finds the most uncertain point for the committee (with vote entropy or average KL divergence for instance) and asks their label to the oracle. But my teacher tells me that, in QBC with abstention, the oracle abstains on the most uncertain points… But that’s precisely the labels we want to…

Does somebody know the high-level idea behind QBC with abstention?

Shayan Shafiq
  • 633
  • 6
  • 17
  • There are good reasons to not label datapoints corresponding to high uncertainties. But this is not referred to as "QBC with abstention". I have not heard of this terminology. Can you share some reference on this? – Saleh Jan 10 '22 at 08:52
  • 1
    Actually there isn’t… I think, anyway nothing on internet… here are the instructions my teacher gave me https://www.icloud.com/iclouddrive/0eagSBgcZbSre6ZEOsNtk6fYg#Active_Learning_with_abstention He asks to implement classic QBC and then QBC with abstention (last paragraph). This is the only thing I have about it… – Valentin Dusollier Jan 10 '22 at 21:26

1 Answers1

1

There are good arguments to sample points corresponding to high uncertainties and good arguments not to.

You could think of these points as points lying in not so dense regions in the features' space; since these points are sparse in your training data, each member of the committee (i.e., the set of classifiers/regressors) will predict their labels rather arbitrarily. Hence, there will be a disagreement between them on the labels of these points. Adding these points to your training dataset will help improve the performance of your final machine learning model in poorly-populated regions of the features' space. This is an argument in favour of sampling points corresponding to high uncertainties.

However, data points lying in low-populated regions of the features' space can be considered outliers. And in application, you don't care about the behaviour of your machine learning (ML) model on outliers. You care about its behaviour in densely populated regions of the features space (assuming that your ML model in real life will need to make predictions on data points following the same distribution of the data points in your pool. More on this can be found here). This is an argument against sampling points corresponding to high uncertainties.

Your question touches on the heart of active learning research; how to build an optimal active learning algorithm/query strategy. Recent research indicates that QBC-like algorithms are not the best we can do. More optimal strategies can be achieved by sampling points corresponding to high uncertainties while not deviating much from the true distribution of data points in the pool, i.e., while not sampling too many outliers. The following paper provides a more formal discussion.

Saleh
  • 623
  • 1
  • 4
  • 11