TF-IDF versus Cosine Similarity in Document Search

Question

I'm wondering if anyone can help me out or point out some resources to learn more about TF-IDF and document search.

I'm trying to implement a basic document search and am trying to better understand the differences and trade offs for my approach.

My current approach is to parse/stem all words in a set of documents and compute a normalized TF-IDF value for each document-word pair. When I query with keywords, I simply look for each word in the keyword, sum the TF-IDF values for each document-word, and rank them that way.

Are there any trade offs/differences/mistakes in using this approach? How does it compare to creating a vector for each document, creating a vector for the search query, and taking the cosine similarity to find the closest matches?

score 6 · Answer 1 · answered Mar 08 '15 at 11:02

Xeon is right in what TF-IDF and cosine similarity are two different things. TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation. However, "one of the simplest ranking functions is computed by summing the tf–idf for each query term". This solution is biased towards long documents where more of your terms will appear (e.g., Encyclopedia Britannica). Also, there are much more advance approaches based on a similar idea (most notably Okapi BM25).

In general, you should use the cosine similarity if you are comparing elements with the same nature (e.g., documents vs documents) or when you need the score itself to have some meaningful value. In the case of cosine similarity, a 1.0 means that the two elements are exactly the same based on their representation. I would recommend these resources to know more about the topic:

Modern Information Retrieval, by Ricardo Baeza-Yates et al.,
Introduction to Information Retrieval, by Christopher Manning et al.

score 3 · Answer 2 · answered Mar 05 '15 at 22:10

TF-IDF is about features and their normalization. Cosine metric is metric that you will use to score.

If my memory is good, TF makes the word counts in a vector normalized. You can then compare TF normalized vectors using the cosine metric. Adding DF weight is about weighting down too frequent terms (e.g. stop words) so they won't dominate other less frequent (and often more informative) features.

Clean your corpus before creating TF-IDF vectors. Do for example stemming (e.g use Porter stemmer). This will reduce the vocabulary and make the word vectors less orthogonal.

Thanks. I have cleaned the corpus already. But, summing the TF-IDF weights is a wrong approach? Should I be creating vectors for each document and a vector for the query and take the cosine of that? — Tim S, Mar 05 '15 at 22:29

TF-IDF versus Cosine Similarity in Document Search

2 Answers2