Classifier for only one class

Question

In a simple classification, we have two classes: class-0 and class-1. In some data I only have values for class-1, so none for class-0. Now I am thinking about making a model to model the data for class-1. So, when new data come, this model is applied to the new data and finds a probability saying how probable that new data fit this model. Then comparing with a threshold, I can filter inappropriate data.

My questions are:

Is this a good way to work with such problems?
Can a RandomForest classifier be used for this case? Do I need to add artificial data for class-0 which I hope the classifier regards as noise?
Any other idea may help for this problem?

Marc Claesen · Accepted Answer · 2013-11-13T07:13:41.880

This is possible using some approaches and is certainly a valid approach. I am not sure if random forests can do this, though.

Generating artificial data means making extra assumptions, don't do that if you don't have to.

One technique you may want to look into is so-called one-class SVM. It does exactly what you are looking for: it tries to build a model which accepts the training points and would reject points from other distributions.

Some references regarding one-class SVM:

Schölkopf, Bernhard, et al. "Estimating the support of a high-dimensional distribution." Neural computation 13.7 (2001): 1443-1471. This paper introduced the approach.
Tax, David MJ, and Robert PW Duin. "Support vector data description." Machine learning 54.1 (2004): 45-66. A different way to do the same thing, probably more intuitive.

Both of these approaches have been shown to be equivalent. The first estimates a hyperplane which separates all the training data from the origin in feature space with maximal distance. The second estimates a hypersphere with minimal radius in feature space containing the training instances.

One-class SVM is available in many SVM packages, including libsvm, scikit-learn (Python) and kernlab (R).

Tax's PhD thesis "One-class classification -- Concept-learning in the absence of counter-examples" is also available: http://homepage.tudelft.nl/n9d04/thesis.pdf — cbeleites unhappy with SX, Nov 13 '13 at 08:21
Short and precise! (+1) "Both of these approaches have been shown to be equivalent." - can you specify a reference / citation for that ? Is it https://scholar.google.de/scholar?q=Machine+Learning+For+Application-Layer+Intrusion+Detection&btnG=&hl=de&as_sdt=0%2C5 — Boern, Apr 24 '17 at 08:14

score 6 · Answer 2 · answered Nov 13 '13 at 08:41

Let me add some more possibilities:

The general idea is that setting a threshold to the distance from the class enables you to decide whether a sample belongs into that class or not, and regardless of whether there are other classes or not.

Mahalanobis-Distance => QDA
SIMCA (Soft Independent Modeling of Class Analogies) uses distances in PCA score space.
SIMCA is common in the chemometric literature (though seldom really set up in a one-class way).
(SVMs are already treated in @Marc Claesen's answer)

Richard G. Brereton: Chemometrics for Pattern Recognition (Wiley, 2009) has a whole chapter about one-class classification.

Classifier for only one class

2 Answers2