TF-IDF vs just TF in text classification

Question

It's common to see people using a tf-idf representation of words for text classification, but I don't understand why not just use tf.

Say $tf(t,d)$ is the $tf$ of term $t$ in a document $d$ and $idf(t)$ is the $idf$ of term $t$. $idf(t)$ will be the same for every document, so it is just a number by which we multiply every $tf(t,d)$. If we fit a linear model, for example, won't multiplying the $tf(t,d)$ by $idf(t)$ just shrink the learned weights by $idf(t)$? If we further normalize or standardize the columns, won't the two representations end up being the same?

I saw the same question here, but I'm not sure if I agree with the accepted answer. It argues that better features help learning, but it doesn't explain how this factor effectively leads to better features when we deal to models invariant to scaling.

IDF reduces the weight given to common words, and highlights the uncommon words in a document. Most news articles aren't about ostriches, so a news article containing "ostrich" is unusual, and we'd like to know that when trying to find documents that are similar. The error in your reasoning is that not all documents contain the same sets of words, so when two documents both contain unusual words, their similarity skyrockets. — Sycorax, Jun 01 '18 at 13:14
The duplicate question is the one that Iinked in the question — jcp, Jun 01 '18 at 13:29
I understand how IDF affects when using similarity metrics, but not how it affects in classification. From what I understand, the learned weights will account for this — jcp, Jun 01 '18 at 13:30
It does look as though the suggested duplicate does answer your question so perhaps you can edit your question to explain how and why it does not? — mdewey, Jun 01 '18 at 13:46
@jcp Kernel SVM Makes classification decisions in the space of similarity scores. — Sycorax, Jun 01 '18 at 13:54
It's true @Sycorax, sorry. I edited the question and made it more clear why the linked question doesn't answer it. — jcp, Jun 01 '18 at 14:09

score 2 · Accepted Answer · answered Jun 01 '18 at 13:32

Consider a matrix where the rows correspond to documents, and the columns correspond to words. In this matrix - exactly as you say - the question of whether to use tf or tf-idf exactly corresponds to the question of whether to scale each column by some constant (which happens to be the idf). The question is just how this scaling (or its absence) affects the method chosen for this matrix.

If the cosine similarity between rows is used, then the scaling matters - cosine similarity is not invariant to scaling. Without idf, the numbber of times "the" is used, can dominate the similarity, for example. Note that cosine similarity was used in some IR (information retrieval) systems.
For random forests and linear models, for example, scaling doesn't matter. For linear models with regularization, or when interpreting the coefficients of linear models, scaling can matter a great deal. Regarding your specific question, you might want to use regularized SVM, in which case the weight of "the" can matter.
If dimension reduction via SVD is used, scaling can matter.
Scaling can matter, for practical reasons, to NNs.

As I thought. For those methods where scaling matters, a further column normalization or standardization would completely remove the "idf" affect, right? — jcp, Jun 01 '18 at 13:43
@jcp Exactly. If you normalize for 0 mean unit variance, for example, idf has no effect. For this reason, you might not want to use normalization in this case, BTW. — Ami Tavory, Jun 01 '18 at 13:59
It's worth noting that if you're using TF-IDF, you usually normalize the **samples** to have unit $L_1$ or $L_2$ norm; this is done because longer documents will have more words, so they can appear similar to many other documents. — Sycorax, Jun 01 '18 at 14:48

TF-IDF vs just TF in text classification

1 Answers1