Suppose I have the following code of an NLTK Naive Bayes Classifier.
It is a toy example of a sentiment analysis implementation.
import nltk
from nltk import NaiveBayesClassifier as nbc
from nltk.tokenize import word_tokenize
from itertools import chain
training_data = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]
vocabulary = set(chain(*[word_tokenize(i[0].lower()) for i in training_data]))
feature_set = [({i:(i in word_tokenize(sentence.lower())) for i in vocabulary},tag) for sentence, tag in training_data]
classifier = nbc.train(feature_set)
test_sentence = "This is the best band I've ever heard!"
featurized_test_sentence = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}
test_sentence1 = "Sun rises in the east"
featurized_test_sentence1 = {i:(i in word_tokenize(test_sentence.lower())) for i in vocabulary}
tag=classifier.classify(featurized_test_sentence)
print "TP:",tag
tag1=classifier.classify(featurized_test_sentence1)
print "FP:",tag1
Now the first test sentence is giving us the tag value “pos”, which is a TP. But the second test example is giving us the tag value “pos”, which is FP.
My objective is if I have a very unknown application sentence which may not be anywhere near in the training set which may be FP, how I may detect it automatically.
Confusion matrix, show_most_informative_features(), prob_classify() is not helping me.