1

I am trying to do sentiment analysis on a corpus of product reviews. My corpus contains 50,000 samples, of which I take 70% for training and 30% for testing. I discretized the 5-star rating to 3 categories as follows:

  • [0, 2] = negative
  • [3] = neutral
  • [4, 5] = positive

As a consequence of this discretization, I see that 42,702 review are positive, 3,230 are negative and 4,068 are neutral.

As features for the classifier, I am using unigrams together with the top 1000 bigram collocations, build using nltk. As classifier, I am using the MultinomialNB from sklearn.

If I evaluate this classifier on my test set, I get the following metrics:

accuracy: 0.8524331858675231
pos precision: 0.8534660677431104
pos recall: 0.9985795454545454
neg precision: 0.19230769230769232
neg recall: 0.0046641791044776115
ntl precision: 1.0
ntl recall: 0.0007412898443291327

As you can see, precision and recall are high, but only for the pos class. If I look at the actual predictions, the classifier always predicts pos.

Is this because of the uneven distribution of positive/negative/neutral review? How could I further improve upon this?

JNevens
  • 269
  • 1
  • 3
  • 15
  • It's really because you are not predicting probabilities. The correct thing to do is to evaluate your model on the basis of its ability to predict the *probability* of positive class membership. Then, if your problem requires a hard decision, to threshold these probabilities based on your understanding of the costs involved. – Matthew Drury Jun 04 '17 at 21:00
  • @MatthewDrury I disagree that probabilistic classification is more correct than ordinary classification. They're different problems. Notably, naive Bayes nominally produces probabilities, but it's much better as an ordinary classifier than a probabilistic classifier, because the probabilities tend to be inaccurate although they're on the right sides of 0.5. – Kodiologist Jun 04 '17 at 21:03
  • Then that is a point where we will have to disagree. IMO the correct first step is always probabilistic understanding, then one can use that information for decision making. One should separate concerns. – Matthew Drury Jun 04 '17 at 21:05
  • @MatthewDrury Do you mean to say that, in a case where all you care about is predicting a label, you should do probabilistic classification and then threshold the probabilities, even if this is slower, more complex, or less accurate in the given case than predicting classes directly? – Kodiologist Jun 04 '17 at 21:07
  • 1
    I have never been in a situation where all I care about is predicting a label. How would I assess my confidence in my decisions? How would I change my decisions when the context changes, when my resources change, or when my costs change? I'd rather do the extra work fitting a probabilistic model that adapts flexibly to my situation, and actually gives me knowledge of the process at play, than have to re-do a hard classifier when my situation inevitably changes and be left with nothing but "I think this is a thing of class A but I can't tell you how sure I am". – Matthew Drury Jun 04 '17 at 21:10

2 Answers2

2

As the classes in a binary classification problem become more imbalanced, the problem of predicting the most likely class becomes easier. When 85% of the sample is one class, you can achieve 85% accuracy by guessing the modal class every time. So, to get much of an improvement in accuracy over such a trivial model, you need really informative features. If you don't happen to have such features, you're out of luck. Or rather, you're very lucky, since you can get high accuracy with a trivial model that requires no features at all.

Often, models in an unbalanced classification problem are evaluated with a criterion other than simple accuracy, such as a weighted cost function. This often makes sense, but be sure to realize that this isn't a solution to the problem just described. Rather, it's a different problem, possibly a less trivial one where statistical models will be more useful.

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
0

This is typical when you're predicting something with unbalanced classes which is weak and hard to predict. Essentially, the "cost of being wrong" is causing the model to just choose pos every time, because it is correct 85% of the time and cannot do any better.

  1. Customize your cost matrix

The best solution is to change the cost matrix so that the penalty for guessing wrong is weighted less for the pos case and more for the others. I don't know how to do that in python, but that is one of the things you should research.

  1. Use classifier that specializes with ordered outcomes

Another option is to try ordered or cumulative logit models. Since there is an inherent order in the dependent variable, this violates the assumptions in most of those multinomial sklearn models, and predicting low vs high is equally wrong as predicting low vs medium, and that poor assumption in the cost calculation is helping the model settle on 85%.

  1. Under Sample pos cases

If you only choose 5 for positive instead of 4,5... You'll have less observations and also be training on the most extreme cases. This might yield better results, but remember that to properly evaluate the model you have to apply the model to the full training set (with the 4s put back in) before judging it's performance.

Josh
  • 383
  • 2
  • 9