8

From the Lucene docs

$\text{IDF} = 1 + \log\left(\frac{\text{numDocs}}{\text{docFreq}+1}\right)$

In other references (i.e. wikipedia), IDF is typically calculated as $\log\left(\frac{\text{numDocs}}{\text{docFreq}}\right)$ or $\log\left(\frac{\text{numDocs}}{\text{docFreq}+1}\right)$ to avoid diving by 0.

I also realize Lucene uses $\sqrt{x}$ rather than $\log(x)$ for calculating TF, but my understanding is that this is just a preferred transformation, probably to avoid $\log(0)$.

Can anyone explain that additional +1 in the IDF term?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Greg Dean
  • 83
  • 4

1 Answers1

9

All TF-IDF weighting schemes are just heuristic methods to give more weight to unusual terms. I'm not sure that TF-IDF schemes generally have a solid statistical basis behind them (see reference 1), except for the observation that TF-IDF tends to produce better results than simple word counts. Since the quality of the results is the primary (sole?) justification for TF-IDF in the first place, one could argue that trying your method with and without +1 and picking the best one would be fine.

If I'm reading this sckit learn thread correctly, it appears that you are not the first person to raise a similar question about adding 1 to IDF scores. The consensus on that thread is that +1 is nonstandard behavior as well. I only skimmed it, but the thread does not appear contain a resounding endorsement or justification of +1.

So the choice of +1 has the effect of placing the lower bound on all IDF values at 1 rather than at 0. This is the same as adding $e$ documents containing every word to your corpus. Not sure why that might be helpful, but perhaps it is in specific contexts. One might even treat some parameter $c$ in $c+\log\left(\frac{\text{numDocs}}{\text{docFreq+1}}\right)$ as a tuning parameter, to give you a more flexible family of IDF schemes with $c$ as their lower bound.

When the lower bound of IDF is zero, the product $\text{term frequency}\times\text{IDF}$ may be 0 for some terms, so that those terms are given no weight at all in the learning procedure; qualitatively, the terms are so common that they provide no information relevant to the NLP task. When the lower-bound is nonzero, these terms will have more influence.

  1. John Lafferty and Guy Lebanon. "Diffusion Kernels on Statistical Manifolds." Journal of Machine Learning. 2005.
Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Thanks for the well through out answer. I was hoping to get a better idea of why the lower bound of 1 for IDF is useful. Interesting that other people have the same question, with no real answer. – Greg Dean May 25 '15 at 06:02
  • @GregDean I'm afraid that this explanation is the best that I can manage. I did some more research to try and find something more definitive, but didn't have much luck. – Sycorax May 26 '15 at 04:45