Similarity measures and document length

Asked Jun 20 '16 at 09:34

Active Jun 20 '16 at 09:34

Viewed 128 times

I have an application where I need to measure the similarity between the (TF-IDF?) representation of two documents: $\mathbf{a}$ and $\mathbf{b}$ while still taking the document length into account. More specifically, if the document $a$ is contained within a much larger document $b$ then I do not want the similarity to decrease significantly, and ideally I would want $\texttt{sim}(\mathbf{a}, \mathbf{a}) \approx \texttt{sim}(\mathbf{a}, \mathbf{b})$.

I was thinking of using cosine similarity without the length normalizations , i.e.

$\texttt{sim}(\mathbf{a}, \mathbf{b}) = \mathbf{a}^T \mathbf{b} * K$

where $K$ is a normalizing constant independent of $a$ and $b$.

Is there a better way to achieve this?

asked Jun 20 '16 at 09:34

kyrre

Similarity measures and document length

0 Answers0