3

In the paper of Martineau & Finin they describe their new approach with Delta TF-IDF . Instead of measuring how rare features are in the document, they weight these values by how biased they are to one corpus.

The way they do it, is by calculating the difference of that word's TFIDF scores in the positive and negative training corpora.

This leads to my question. Can I use this vectorizer for my mutli class classification problem? In my case I try to predict a star rating from 1-5. So there are no negative labels but it still outperforms the normal TFIDF. I'm not quite sure why and how this is even possible because in my understanding Delta TFIDF is just for binary classification.

star = imdb['rating'].tolist() #float
review= imdb['review'].tolist() #string
review = [str(i) for i in review]


from sklearn_deltatfidf import DeltaTfidfVectorizer
vectorizer = DeltaTfidfVectorizer( lowercase=True, analyzer='word', stop_words= 'english',ngram_range=(1,2), max_features = 1000)

train_data_features = vectorizer.fit_transform(review, star)

from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(train_data_features, stars, test_size=0.33, random_state=42)

classifier = LinearSVC()
classifier.fit(X_train, y_train)
preds = classifier.predict(X_test)

report(y_test, preds)

sklearn-deltatfidf

jonas00
  • 81
  • 4
  • 2
    The article you link to uses movie reviews to illustrate how Delta TFIDF works well on sentiment problems. Since your problem is also a movie problem, it's hard to see how the answer to your question is not obvious. Can you edit your question to explain how your movie review problem is different from the movie review problem in the article that you link? – Sycorax Jul 27 '18 at 13:46
  • I edited my question. I hope that my problem is more understandable now. – jonas00 Jul 31 '18 at 09:22
  • 1
    (+1) This looks like a good edit. I think that the question is sufficiently clear for a person to answer, so I have voted to re-open. A certain number of other reviewers will have to vote to do the same, or a moderator will have to vote re-open, for it to be re-opened. – Sycorax Jul 31 '18 at 21:31
  • "So there are no negative labels but it still outperforms the normal TFIDF." This leads me to think you have performed some kind of test - can you provide the code? It is not unreasonable that some words will be biased towards a specific star rating providing a boost – tRosenflanz Aug 01 '18 at 20:45
  • It looks like LinearSVC will perform a 1-vs-rest classification by default, unless asked to do something else via the 'multi_class' parameter. (see [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)). So there are actually 5 binary classifications in your code. In that case the Delta TFIDF seems applicable. – Jelle Schühmacher Aug 02 '18 at 11:59
  • Can you elaborate your answer? I know how one-vs-all classification works and that LinerSVC does it by default. Why does the Delta TFIDF seems applicable because of one-vs-all? – jonas00 Aug 02 '18 at 13:45

0 Answers0