1

I am working with a semantic analysis problem and wanted to know if anyone has been able to set a default value, say a probability of zero or 0.5 for phrases/words that the machine learning algorithm has never seen. Using scikit-learn's classifiers and nltk's word_vectorizer I have experienced probability predictions of 1.0 for words and phrases not in the training, which is a potentially misleading output with absolute confidence.

Would adding a dictionary of English not in the training with a target of zero help?

What about non-words or incorrect spellings? How do you punish the unknown/unseen without explicitly punishing all permutations of words or word-spellings not in the corpus?

MyopicVisage
  • 133
  • 6

0 Answers0