In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Questions tagged [tf-idf]
83 questions
19
votes
2 answers
How is the .similarity method in SpaCy computed?
Not Sure if this is the right stack site, but here goes.
How does the .similiarity method work?
Wow spaCy is great! Its tfidf model could be easier, but w2v with only one line of code?!
In his 10 line tutorial on spaCy andrazhribernik show's us the…

whs2k
- 451
- 1
- 3
- 10
8
votes
1 answer
Why does Lucene IDF have a seemingly additional +1?
From the Lucene docs
$\text{IDF} = 1 + \log\left(\frac{\text{numDocs}}{\text{docFreq}+1}\right)$
In other references (i.e. wikipedia), IDF is typically calculated as $\log\left(\frac{\text{numDocs}}{\text{docFreq}}\right)$ or…

Greg Dean
- 83
- 4
7
votes
2 answers
Which weighting factor to use for text categorization
I am working on a text categorization task, and I possess 21,000 documents for training, and (for the time being), 7000 documents for testing. I construct the doc-term matrix for both training corpus and testing corpus, with two different weighting…

Ensom Hodder
- 455
- 5
- 14
6
votes
1 answer
Difference between Log Entropy Model and TF-IDF Model?
I would like to understand what are the differences/advantages in using TF-IDF or the Log Entropy model for represeting documents and queries in an information retrieval system using diferent weights.
I've tested both of them and computed the recall…

yolanda_dlh
- 63
- 1
- 5
5
votes
1 answer
What does word embedding weighted by tf-idf mean?
The paper that I am reading explains about how it implemented the feature vector used for a twitter sentiment classification task.
The first is a simple combination, where each tweet is represented by
the average of the word embedding vectors of…

dawn
- 51
- 1
- 2
4
votes
0 answers
Google gender-pay gap vs
Background:
I read this:
google schools US government about gender pay gap.
It derives from this google blog post by Eileen Naughton, VP of People Operations.
She asserts that google is somehow "sharing" it top-level analysis publicly. Top-level…

EngrStudent
- 8,232
- 2
- 29
- 82
4
votes
2 answers
How to use TFIDF-vectors with Multinomial Naive-Bayes?
Say we have used the TFIDF transform to encode documents into continuous-valued features.
How would we now use this as input to a Naive Bayes classifier?
Bernoulli naive-bayes is out, because our features aren't binary anymore.
Seems like we can't…

dhrumeel
- 281
- 1
- 2
- 8
3
votes
1 answer
Tf-idf for text classification: On what should IDF be calculated?
The TF-IDF value of a word specifies how important a word for each document is. My setting is any text classification where one has multiple documents of with different classes:
Let's take a lot of movie reviews with a feature 'sentiment' which is 0…

Nickkon
- 71
- 7
3
votes
0 answers
Delta TF-IDF right choice for multi classification problem
In the paper of Martineau & Finin they describe their new approach with Delta TF-IDF . Instead of measuring how rare features are in the document, they weight these values by how biased they are to one corpus.
The way they do it, is by calculating…

jonas00
- 81
- 4
3
votes
2 answers
Understanding and interpreting the output of Spark's TF-IDF implementation
I am currently trying to understand what the example code provided as part of Spark's TF-IDF implementation is doing.
Given the example code block (taken from Spark's Github repository)
val sentenceData = sess.createDataFrame(Seq(
(0.0, "Hi I…

Jesús Zazueta
- 201
- 2
- 7
3
votes
0 answers
typical TFIDF range?
Forgive me if this is the wrong place for my question. My question is, I hope fairly straightforward. Although I am using Rstudio and the TM package, I think my question is more about the math behind the TFIDF score and is similar to an older…

Yossarianlives
- 31
- 1
- 3
3
votes
1 answer
Finding most statistically distinct text across classes
I want to find the most statistically distinct phrases in texts across different classes, where the texts have already been classified.
Suppose I have ~100,000 short text documents, where each document has a label. Let's say I have 100,000 text…

cjrieds
- 133
- 4
3
votes
1 answer
TFIDF for feature selection
Input to tfidf is a list of documents, output is a matrix which contains a numerical value for each (word, document) pair. How can I use that matrix to perform feature selection, i.e. reduce the size of the dictionary?

Baron Yugovich
- 515
- 1
- 6
- 18
2
votes
1 answer
How to calculate tf-idf for a single term
I am following the tf-idf method described in this paper: Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network.
In the paper I have linked above (see equation 2 in the paper), they have…

EmJ
- 592
- 3
- 15
2
votes
0 answers
Alternative to tf-idf with smaller penalty on previous usage
I am using tf-idf to find words that are particularly important to individual documents. This works pretty well for my purposes. However, one area where I feel like it isn't great is how harshly words are penalised for being in all other documents,…

Jaccar
- 141
- 2