4

I am working on a text corpus. Each line contains between 10 and 50 words. There are around 25 000 words in the whole text and 1 000 000 lines. I turned this corpus into its tf-idf representation.

I was wondering is there is a sense to "high" and "low" values for each tf-idf. Can I ignore, say "words when their tf-idf is lower than 1.5" ?

Or are there real-world pathological text corpuses where tf-idf can have any values ?

RUser4512
  • 9,226
  • 5
  • 29
  • 59
  • It is quite common to ignore terms/words that appear very few times (e.g. less than 3 times) in the whole corpus. These can sometimes produce extreme values otherwise. – dcorney Aug 07 '15 at 15:21

0 Answers0