TfidfVectorizer: should it be used on train only or train+test

Question

When training a model it is possible to train the Tfidf on the corpus of only the training set or also on the test set.

It seems not to make sense to include the test corpus when training the model, though since it is not supervised, it is also possible to train it on the whole corpus.

What is better to do?

if you calculate tf-idf on the entire data set, how would you check if your model generalizes? — Alexey Grigorev, May 30 '15 at 07:48
Well, the bigger point is that with "real" new unseen data, you could still use the words into the Tfidf, altering the Tfidf. You can then use the training data to make a train/test split and validate a model. But basically you can still make use of the "unsupervised" new data. — PascalVKooten, May 30 '15 at 18:11

score 7 · Answer 1 · answered Jan 08 '19 at 13:57

Using TF-IDF-vectors, that have been calculated with the entire corpus (training and test subsets combined), while training the model might introduce some data leakage and hence yield in too optimistic performance measures. This is because the IDF-part of the training set's TF-IDF features will then include information from the test set already.

Calculating them completely separately for the training and test set is not a good idea either, because besides testing the quality of your model then you will be also testing the quality of your IDF-estimation. And because the test set is usually small this will be a poor estimation and will worsen your performance measures.

Therefore I would suggest (analogously to the common mean imputation of missing values) to perform TF-IDF-normalization on the training set seperately and then use the IDF-vector from the training set to calculate the TF-IDF vectors of the test set.

score 2 · Answer 2 · answered Apr 16 '17 at 15:42

Usually, as this site's name suggests, you'd want to separate your train, cross-validation and test datasets. As @Alexey Grigorev mentioned, the main concern is having some certainty that your model can generalize to some unseen dataset.

In a more intuitive way, you'd want your model to be able to grasp the relations between each row's features and each row's prediction, and to apply it later on a different, unseen, 1 or more rows.

These relations are at the row level, but they are learnt at deep by looking at the entire training data. The challenge of generalizing is, then, making sure the model is grasping a formula, not depending (over-fitting) on the specific set of training values.

I'd thus discern between two TFIDF scenarios, regarding how you consider your corpus:

1. The corpus is at the row level

We have 1 or more text features that we'd like to TFIDF in order to discern some term frequencies for this row. Usually it'd be a large text field, important by "itself", like an additional document describing a house buying contract in house sale dataset. In this case the text features should be processed at the row level, like all the other features.

2. The corpus is at the dataset level

In addition to having a row context, there is meaning to the text feature of each row in the context of the entire dataset. Usually a smaller text field (like a sentence). The TFIDF idea here might be calculating some "rareness" of words, but in a larger context. The larger context might be the entire text column from the train and even the test datasets, since the more corpus knowledge we'd have - the better we'd be able to ascertain the rareness. And I'd even say you could use the text from the unseen dataset, or even an outer corpus. The TFIDF here helps you feature-engineering at the row-level, from an outside (larger, lookup-table like) knowledge

Take a look at HashingVectorizer, a "stateless" vectorizer, suitable for a mutable corpus

John Curry · Answer 3 · 2020-09-21T17:41:05.700

An incremental approach is robust to leakage.

Use case: train document classification model against large corpus and test on a new set of documents.

At training Time: Calculate TF-IDF on training data and use as features for classification model.

At test Time: Add new documents to corpus and recalculate TF-IDF on whole corpus. Use TF-IDF values for the new document as inputs to model for scoring.

If the number of documents being tested/scored is small, to speed up the process, you may wish to recalculate only the TF and use the existing IDF figures as they won't be affected much by a small number of docs.

Live Use: Same as Test. I.e. the approach is robust to live use and doesn't leak.

Calculating a new TFIDF vector including test corpus would mean the new vector has more dimensions than the model parameters trained only on train time. How would you account for that? — Maverick Meerkat, Apr 13 '21 at 05:33

score 0 · Answer 4 · answered Dec 19 '18 at 06:13

0

Ideally it should fit on entire corpus, so as to learn vocabulary and give score to each.

Corpus = Grouped Data around certain entity or entire corpus. Train = Training Data Split from certain entity or entire corpus. Test = Test Data Split from certain entity or entire corpus.

vec = TfidfVectorizer() vec.fit(corpus) trainx = tf.transform(train) testx = tf.transform(test)

answered Dec 19 '18 at 06:13

user231699

1

2

This is not how I understand test sets to work, could you provide some references and maybe more detail to support your argument? The point of test sets is precisely that, to test your model independent of the training set. It is a first line estimation of the transfer-ability of the model to new data, so using the test set to build the model would eliminate this insight. – ReneBt Dec 19 '18 at 08:25

TfidfVectorizer: should it be used on train only or train+test

4 Answers4