2

I am using tf-idf to find words that are particularly important to individual documents. This works pretty well for my purposes. However, one area where I feel like it isn't great is how harshly words are penalised for being in all other documents, regardless of whether there are big frequency differences - i.e., they would score 0.

As an example, if the word "dog" was in every document once or twice, but then in one specific document 100 times, it would get a tf-idf of 0. Whereas if the word "cat" was only in one document once, it would get a higher score. Although I can see that an aspect of novelty is not appearing in other places, it seems to me that in this instance, it would be more helpful to know that "dog" is an important word in the document rather than "cat". See the obviously silly example below:

   document  word freq    tf      idf      tf_idf
1:        A   dog    2 0.002 0.000000 0.000000000
2:        B   dog    1 0.001 0.000000 0.000000000
3:        C   dog    2 0.002 0.000000 0.000000000
4:        D   dog  100 0.100 0.000000 0.000000000
5:        D   cat    1 0.001 1.386294 0.001386294
6:        A other  998 0.998 0.000000 0.000000000
7:        B other  999 0.999 0.000000 0.000000000
8:        C other  998 0.998 0.000000 0.000000000
9:        D other  899 0.899 0.000000 0.000000000

I'm therefore a bit concerned that I'm missing some words that should be more important than they appear. However, apart from this, the approach is doing quite well.

Are there approaches I could use that would get me something similar (an ordered output of the most important / distinct words in the document), which don't attach such a big penalty to words for having appeared before, but still value novelty?

Jaccar
  • 141
  • 2

0 Answers0