21

For the sake of simplicity, let's say I'm working on the classic example of spam/not-spam emails.

I have a set of 20000 emails. Of these, I know that 2000 are spam but I don't have any example of not-spam emails. I'd like to predict whether the remaining 18000 are spam or not. Ideally, the outcome I'm looking for is a probability (or a p-value) that the email is spam.

What algorithm(s) can I use to make a sensible prediction in this situation?

At the moment, I'm thinking of a distance-based method that would tell me how similar my email is to a known spam email. What options do I have?

More generally, can I use a supervised learning method, or do I necessarily need to have negative cases in my training set to do that? Am I limited to unsupervised learning approaches? What about semi-supervised methods?

enricoferrero
  • 506
  • 5
  • 14
  • 1
    Any learning algorithm you will use will predict all mails to be spam. You must have examples from the two categories for any sensible learning. – JohnRos Sep 27 '15 at 13:43
  • OK, that would rule out a classic supervised learning approach. But is there an algorithm that returns some sort of similarity metrics? E.g.: this email is very similar to a spam email, this other one is not. – enricoferrero Sep 27 '15 at 13:48
  • 6
    @JohnRos not true, learning from positive and unlabeled data is a big topic in semi-supervised learning and it is nothing like you describe. – Marc Claesen Sep 27 '15 at 14:44
  • 5
    @MarcClaesen: I was unfamiliar with this (very cool) line of research. I see that that magic is in the assumption that the unlabeled data is a mixture of spam and non-spam, which renders the problem solvable. – JohnRos Sep 27 '15 at 14:51
  • 2
    @JohnRos exactly, and I agree with the coolness factor. What I find really cool is the fact we have recently been able to show [how to compute traditional performance metrics based on contingency tables (e.g. accuracy, precision, recall, ...) *without* known negatives](http://arxiv.org/abs/1504.06837)! – Marc Claesen Sep 27 '15 at 14:54

3 Answers3

20

This is called learning from positive and unlabeled data, or PU learning for short, and is an active niche of semi-supervised learning.

Briefly, it is important to use the unlabeled data in the learning process as it yields significantly improved models over so-called single-class classifiers that are trained exclusively on known positives. Unlabeled data can be incorporated in several ways, the predominant approaches being the following:

  • somehow infer a set of likely negatives from the unlabeled data and then train a supervised model to distinguish known positives from these inferred negatives.
  • treat the unlabeled set as negative and somehow account for the label noise that is known to be present.

I am active in this field, and rather than summarizing it here for you, I recommend reading two of my papers and the references therein to get an overview of the domain:

Marc Claesen
  • 17,399
  • 1
  • 49
  • 70
  • 1
    Excellent! Many thanks for the references. The RESVM and bagged SVM seem to perform similarly in a PU learning setting. Can you recommend implementations (preferably in R) of either algorithm? Neither seems to be included in caret, unfortunately. – enricoferrero Sep 27 '15 at 20:18
  • 1
    @enricoferrero Yes, they perform similarly unless there are false (known) positives, in which case RESVM significantly outperforms bagging SVM (I designed RESVM for that purpose, as the application I work on has false positives). I don't think there are R implementations readily available, but you can implement both of them quite easily by wrapping an SVM implementation like `kernlab` or `e1071`. Note that both bagging SVM and RESVM have a number of hyperparameters that you have to optimize, for which I recommend the [Optunity](http://docs.optunity.net/) library (has an R interface). – Marc Claesen Sep 28 '15 at 02:48
  • 1
    @enricoferrero I do have a command-line implementation of RESVM available at https://github.com/claesenm/resvm, though that code isn't polished very well. That specific repo is written in Python and is used as a driver for the [EnsembleSVM](http://jmlr.org/papers/volume15/claesen14a/claesen14a.pdf) package. – Marc Claesen Sep 28 '15 at 02:56
  • It looks like another good option for a bagged SVM algorithm could be to use the [mlr](https://github.com/mlr-org/mlr) package in R with a [bagging wrapper](http://mlr-org.github.io/mlr-tutorial/release/html/bagging/index.html) around an SVM [learner](http://mlr-org.github.io/mlr-tutorial/release/html/learner/index.html). – enricoferrero Sep 29 '15 at 07:39
7

I am assuming there aren't as many spam cases in your 18000 cases. To use a supervised learning approach to this, you need to have more than 1 category/class in your data. Since you know 2000 cases are spam, you can label the remaining 18000 cases as 'unknown category' and train any supervised learning model to predict if a case is in the spam or the unknown category. Then check your out of sample model accuracy to see how well the model performs to distinguish between the 2 categories. If it performs well, then my assumption of few spam cases in the 'unknown' category is warranted. If it doesn't perform well, then you'll have to use an unsupervised learner(like kmeans, etc) to cluster and identify separate homogenous groups in your data. Then identify which clusters contain the most of the 2000 spam emails, and which ones do not, and label them as spam and non spam respectively. Next, you can proceed with modeling using a supervised learner like I described earlier.

FelixNNelson
  • 104
  • 3
2

What the OP is talking about is a one-class classification task, which is a very challenging one.

There are many papers on this task across different research fields. I also wrote one An Efficient Intrinsic Authorship Verification Scheme Based on Ensemble Learning. It is very easy to adapt it in order to classify spam/not spam, rather than authors. Give it a try and let me know if you need further details...

NeuroMorphing
  • 525
  • 2
  • 12