Semi supervised classification with unseen classes

Question

Consider the following problem. You have a large dataset, some small subset of which have labels from the classes A, B and C. I would like to classify the unlabelled subset of items each of which can be from classes A, B and C or (crucially) also from other classes I have not seen any labels for yet.

The ideal result would be a full labeling of the unlabelled subset with classes, A, B, C, D, E, ...

Is this an example of semi-supervised classification and what are good approaches one can take to this kind of problem?

The possibility of "new" classes brings to mind Bayesian non-parametrics and specifically the [Chinese Restaurant process](https://en.wikipedia.org/wiki/Chinese_restaurant_process). I don't know enough about it to write an actual answer, though. — Aniko, Oct 30 '15 at 16:37

score 5 · Answer 1 · answered Sep 30 '15 at 17:30

Have a look at one-class-classifiers. These are classifiers that can tell you that a new object does not agree with your training data.

I'd train a regular multi-class classifier, and a one-class classifier on all your data. If a new instance is rejected by the one-class classifier, have the user either assign it to one of the existing classes; or have the user assign a new class. Then update your classificator.

score 4 · Accepted Answer · answered Oct 28 '15 at 10:17

This is a very interesting framework.

Building one-vs-all classifiers will help you to identify A,B,C and "others". However, it won't be able to to differ between D,E and the rest in "others".

I think that you should cluster your data in order to identify the clusters of the unknown class. If you have a distance function at hand, you can evaluate how well it separates the known classes. However, you can actually learn a proper distance function.

Let L be your labeled dataset. Build a pair dataset for all pairs x,y in L. Let the concept of the pairs dataset be the desired distance. If class(x)=class(y), the distance should be zero. If the class is different is is domain question of the needed distance (e.g., the distance between A and B might be smaller than the distance between B and C).

Now train a regressor on the pairs dataset.

Use the regressor as the distance function to your clustering algorithm. Hierarchal clustering algorithms seems to fit well to your needs.

Run the clustering algorithm on the unlabelled data to get clusters of samples. If you also have one-vs-all classifiers fro the known classes, run them on the samples. Clusters were the samples tend not belong to the known classes are the candidates for the new classes.

score 3 · Answer 3 · answered Sep 30 '15 at 16:17

I would treat this as a set of semi-supervised one-vs-all classification problems, that is build a binary semi-supervised classifier for each known target class by treating its known instances as positives, instances with different known labels as negative (if classes are mutually exclusive) and the remainder as unlabeled. A common and effective way of incorporating unlabeled instances in the learning process is to treat them as negatives with a very low misclassification penalty (far lower than known negatives).

Unlabeled instances that get rejected by all of the resulting classifiers are then likely part of some class you have no labels of. A subsequent step could then be to cluster all of these unclassified instances in an attempt to determine the number of classes you have no labels of, though this is far from easy.

Thank you this looks like a good idea. One vague hope I had however is that you could use the labelled data or even the unlabelled data you label using the labeled data to derive an idea of a sensible distance function to use with the rest of the data. I am assuming Euclidean distance isn't enough. Is this possible? — graffe, Sep 30 '15 at 17:53

score 2 · Answer 4 · answered Nov 01 '15 at 00:58

Without example prototypes for the additional groups D,E,etc, identifying them as independent clusters may or may not be needed. If the data can be modeled well by something like a gaussian mixture model, a well-fit model may indeed involve identification of these extra groups. However, such methods are largely unsupervised in nature... there is little need or use for the prototypes you have for A,B,C beyond seeding the collection.

Another method would be to tune a classification model using your labeled dataset, and then apply it to the unlabeled set to obtain an expected classification. Define a threshold of uncertianty, and use items which do not classify below it to seed a new collection of items - that being your unknown new category. Use these new populations to retrain your classifiers. This method is generally known as the Expectation-Maximization algorithm in K-Means and Gaussian Mixture Models, but the general logic can be implemented using neural networks or random forests as classification engines as well. If you need to identify category structure within that newly-identified category, you will need to use an unsupervised technique such as clustering.

The other way to identify the new category is certain "single-class" classifiers whose intention is to identify population outliers. For example, single-class SVMs. I have also experimented with single-class decision trees. However, such methods would not use most of the data you have and I would not expect superior results.

Semi supervised classification with unseen classes

4 Answers4

Linked