0

I'm trying to apply binary classification with OpenNLP. I could already successfully classify movies by different genres. The data sample has the form:

Genre1 Description 1
Genre1 Description 2
Genre2 Description 3
etc.

Evaluating my model I get a correct classification of another movie description:

Thriller : 2.1301979251908337E-14
Romantic : 0.9999999999999787
---------------------------------

Romantic : is the predicted category for the given sentence.

But how can I apply a binary classification, where the result should be either yes or no? So in my case, I would like to know if a movie belongs to the genre romantic or not. I'm not interested in the other genres.

When I just delete all the other genres from my data sample:

Genre1 Description 1
Genre1 Description 2

I get the following error:

Training data must contain more than one outcome

This is my code (source):

public static void main(String[] args) {

    try {
        // read the training data
        Path path = Paths.get("C:", "Users");

        InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File(path + File.separator + "training" + File.separator + "en-movie-categoryNew.train"));
        ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
        ObjectStream sampleStream = new DocumentSampleStream(lineStream);

        // define the training parameters
        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 10 + "");
        params.put(TrainingParameters.CUTOFF_PARAM, 0 + "");
        params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);

        // create a model from traning data
        DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, new DoccatFactory());
        System.out.println("\nModel is successfully trained.");

        // save the model to local
        BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream(path + File.separator + "model" + File.separator + "en-movie-classifier-maxent.bin"));
        model.serialize(modelOut);
        System.out.println("\nTrained Model is saved locally at : " + "model" + File.separator + "en-movie-classifier-maxent.bin");

        // test the model file by subjecting it to prediction
        DocumentCategorizer doccat = new DocumentCategorizerME(model);
        String[] docWords = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold".replaceAll("[^A-Za-z]", " ").split(" ");
        double[] aProbs = doccat.categorize(docWords);

        // print the probabilities of the categories
        System.out.println("\n---------------------------------\nCategory : Probability\n---------------------------------");
        for (int i = 0; i < doccat.getNumberOfCategories(); i++) {
            System.out.println(doccat.getCategory(i) + " : " + aProbs[i]);
        }
        System.out.println("---------------------------------");

        System.out.println("\n" + doccat.getBestCategory(aProbs) + " : is the predicted category for the given sentence.");
    } catch (IOException e) {
        System.out.println("An exception in reading the training file. Please check.");
        e.printStackTrace();
    }
}
gaout5
  • 3
  • 2

1 Answers1

0

Most of the machine learning algorithms are not "hard" classifiers, but "soft" classifiers. This means that instead of returning the predicted class (or yes/no answers for binary cases), they return a score. The score often is a probability of the sample belonging to the class, as in your example. To convert the score to the hard classification, you need a decision function. The simplest thing you can do for binary classification to use the 0.5 cutoff, i.e. probability > 0.5 means that the model predicted the positive class. Please notice however that the 0.5 threshold is quite arbitrary, as you can learn from the Why P>0.5 cutoff is not "optimal" for logistic regression? thread.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Ok, I understand the classification part. But how can I modify my training data, so that it accepts only 1 category (i.e. romantic)? I'm not interested in the other categories. So my question is a little OpenNLP-specific. – gaout5 Jun 30 '21 at 08:56
  • @g1a1out5 if you modify the training data so that it contains *only* the samples of a single class, then the only thing your model can learn from such data is that in the training set it *always* saw romantic movies, hence it will *always* predict the "romantic" class. Technically, this would not happen, but rather the code would fail. The classifier *always* predicts probability of some class vs other classes. If you just want to detect things that aren't romantic movies, you could use [tag:anomaly-detection] algorithms instead. – Tim Jun 30 '21 at 09:01
  • Notice however that treating classification problem as anomaly detection is usually [not the best idea](https://stats.stackexchange.com/questions/530508/multiclass-classification-vs-binary-classification-with-class-merging-predictio/530609#530609). – Tim Jun 30 '21 at 09:02