Why does Lucene IDF have a seemingly additional +1?

Question

$\text{IDF} = 1 + \log\left(\frac{\text{numDocs}}{\text{docFreq}+1}\right)$

In other references (i.e. wikipedia), IDF is typically calculated as $\log\left(\frac{\text{numDocs}}{\text{docFreq}}\right)$ or $\log\left(\frac{\text{numDocs}}{\text{docFreq}+1}\right)$ to avoid diving by 0.

I also realize Lucene uses $\sqrt{x}$ rather than $\log(x)$ for calculating TF, but my understanding is that this is just a preferred transformation, probably to avoid $\log(0)$.

Can anyone explain that additional +1 in the IDF term?

Sycorax · Accepted Answer · 2015-05-26T04:40:32.160

All TF-IDF weighting schemes are just heuristic methods to give more weight to unusual terms. I'm not sure that TF-IDF schemes generally have a solid statistical basis behind them (see reference 1), except for the observation that TF-IDF tends to produce better results than simple word counts. Since the quality of the results is the primary (sole?) justification for TF-IDF in the first place, one could argue that trying your method with and without +1 and picking the best one would be fine.

If I'm reading this sckit learn thread correctly, it appears that you are not the first person to raise a similar question about adding 1 to IDF scores. The consensus on that thread is that +1 is nonstandard behavior as well. I only skimmed it, but the thread does not appear contain a resounding endorsement or justification of +1.

So the choice of +1 has the effect of placing the lower bound on all IDF values at 1 rather than at 0. This is the same as adding $e$ documents containing every word to your corpus. Not sure why that might be helpful, but perhaps it is in specific contexts. One might even treat some parameter $c$ in $c+\log\left(\frac{\text{numDocs}}{\text{docFreq+1}}\right)$ as a tuning parameter, to give you a more flexible family of IDF schemes with $c$ as their lower bound.

When the lower bound of IDF is zero, the product $\text{term frequency}\times\text{IDF}$ may be 0 for some terms, so that those terms are given no weight at all in the learning procedure; qualitatively, the terms are so common that they provide no information relevant to the NLP task. When the lower-bound is nonzero, these terms will have more influence.

John Lafferty and Guy Lebanon. "Diffusion Kernels on Statistical Manifolds." Journal of Machine Learning. 2005.

Thanks for the well through out answer. I was hoping to get a better idea of why the lower bound of 1 for IDF is useful. Interesting that other people have the same question, with no real answer. — Greg Dean, May 25 '15 at 06:02
@GregDean I'm afraid that this explanation is the best that I can manage. I did some more research to try and find something more definitive, but didn't have much luck. — Sycorax, May 26 '15 at 04:45

Why does Lucene IDF have a seemingly additional +1?

1 Answers1

Linked