In the paper of Martineau & Finin they describe their new approach with Delta TF-IDF . Instead of measuring how rare features are in the document, they weight these values by how biased they are to one corpus.
The way they do it, is by calculating the difference of that word's TFIDF scores in the positive and negative training corpora.
This leads to my question. Can I use this vectorizer for my mutli class classification problem? In my case I try to predict a star rating from 1-5. So there are no negative labels but it still outperforms the normal TFIDF. I'm not quite sure why and how this is even possible because in my understanding Delta TFIDF is just for binary classification.
star = imdb['rating'].tolist() #float
review= imdb['review'].tolist() #string
review = [str(i) for i in review]
from sklearn_deltatfidf import DeltaTfidfVectorizer
vectorizer = DeltaTfidfVectorizer( lowercase=True, analyzer='word', stop_words= 'english',ngram_range=(1,2), max_features = 1000)
train_data_features = vectorizer.fit_transform(review, star)
from sklearn.svm import LinearSVC
X_train, X_test, y_train, y_test = train_test_split(train_data_features, stars, test_size=0.33, random_state=42)
classifier = LinearSVC()
classifier.fit(X_train, y_train)
preds = classifier.predict(X_test)
report(y_test, preds)