I am trying to do sentiment analysis on a corpus of product reviews. My corpus contains 50,000 samples, of which I take 70% for training and 30% for testing. I discretized the 5-star rating to 3 categories as follows:
- [0, 2] = negative
- [3] = neutral
- [4, 5] = positive
As a consequence of this discretization, I see that 42,702 review are positive, 3,230 are negative and 4,068 are neutral.
As features for the classifier, I am using unigrams together with the top 1000 bigram collocations, build using nltk
. As classifier, I am using the MultinomialNB
from sklearn
.
If I evaluate this classifier on my test set, I get the following metrics:
accuracy: 0.8524331858675231
pos precision: 0.8534660677431104
pos recall: 0.9985795454545454
neg precision: 0.19230769230769232
neg recall: 0.0046641791044776115
ntl precision: 1.0
ntl recall: 0.0007412898443291327
As you can see, precision and recall are high, but only for the pos
class. If I look at the actual predictions, the classifier always predicts pos
.
Is this because of the uneven distribution of positive/negative/neutral review? How could I further improve upon this?