Highest Voted 'tf-idf' Questions - Statistical Analysis Stack Exchange

19

votes

2 answers

How is the .similarity method in SpaCy computed?

Not Sure if this is the right stack site, but here goes. How does the .similiarity method work? Wow spaCy is great! Its tfidf model could be easier, but w2v with only one line of code?! In his 10 line tutorial on spaCy andrazhribernik show's us the…

asked Sep 21 '17 at 02:40

whs2k

451
1
3
10

8

votes

1 answer

Why does Lucene IDF have a seemingly additional +1?

From the Lucene docs $\text{IDF} = 1 + \log\left(\frac{\text{numDocs}}{\text{docFreq}+1}\right)$ In other references (i.e. wikipedia), IDF is typically calculated as $\log\left(\frac{\text{numDocs}}{\text{docFreq}}\right)$ or…

information-retrieval tf-idf

asked May 13 '15 at 18:10

Greg Dean

83
4

7

votes

2 answers

Which weighting factor to use for text categorization

I am working on a text categorization task, and I possess 21,000 documents for training, and (for the time being), 7000 documents for testing. I construct the doc-term matrix for both training corpus and testing corpus, with two different weighting…

machine-learning data-mining text-mining tf-idf

asked Jun 06 '12 at 20:51

Ensom Hodder

455
5
14

6

votes

1 answer

Difference between Log Entropy Model and TF-IDF Model?

I would like to understand what are the differences/advantages in using TF-IDF or the Log Entropy model for represeting documents and queries in an information retrieval system using diferent weights. I've tested both of them and computed the recall…

natural-language information-retrieval nltk tf-idf

asked May 30 '16 at 17:18

yolanda_dlh

63
1
5

5

votes

1 answer

What does word embedding weighted by tf-idf mean?

The paper that I am reading explains about how it implemented the feature vector used for a twitter sentiment classification task. The first is a simple combination, where each tweet is represented by the average of the word embedding vectors of…

machine-learning natural-language word2vec word-embeddings tf-idf

asked Dec 14 '17 at 20:54

dawn

51
1
2

4

votes

0 answers

Google gender-pay gap vs

Background: I read this: google schools US government about gender pay gap. It derives from this google blog post by Eileen Naughton, VP of People Operations. She asserts that google is somehow "sharing" it top-level analysis publicly. Top-level…

machine-learning classification natural-language tf-idf bag-of-words

asked Apr 11 '17 at 15:33

EngrStudent

8,232
2
29
82

4

votes

2 answers

How to use TFIDF-vectors with Multinomial Naive-Bayes?

Say we have used the TFIDF transform to encode documents into continuous-valued features. How would we now use this as input to a Naive Bayes classifier? Bernoulli naive-bayes is out, because our features aren't binary anymore. Seems like we can't…

scikit-learn naive-bayes tf-idf

asked Apr 05 '17 at 01:19

dhrumeel

281
1
2
8

3

votes

1 answer

Tf-idf for text classification: On what should IDF be calculated?

The TF-IDF value of a word specifies how important a word for each document is. My setting is any text classification where one has multiple documents of with different classes: Let's take a lot of movie reviews with a feature 'sentiment' which is 0…

r classification feature-selection text-mining tf-idf

asked Aug 31 '18 at 12:38

Nickkon

71
7

3

votes

0 answers

Delta TF-IDF right choice for multi classification problem

In the paper of Martineau & Finin they describe their new approach with Delta TF-IDF . Instead of measuring how rare features are in the document, they weight these values by how biased they are to one corpus. The way they do it, is by calculating…

text-mining sentiment-analysis tf-idf

asked Jul 26 '18 at 17:56

jonas00

81
4

3

votes

2 answers

Understanding and interpreting the output of Spark's TF-IDF implementation

I am currently trying to understand what the example code provided as part of Spark's TF-IDF implementation is doing. Given the example code block (taken from Spark's Github repository) val sentenceData = sess.createDataFrame(Seq( (0.0, "Hi I…

machine-learning text-mining spark-mllib tf-idf

asked Nov 01 '17 at 16:24

Jesús Zazueta

201
2
7

3

votes

0 answers

typical TFIDF range?

Forgive me if this is the wrong place for my question. My question is, I hope fairly straightforward. Although I am using Rstudio and the TM package, I think my question is more about the math behind the TFIDF score and is similar to an older…

text-mining tf-idf

asked Mar 20 '17 at 14:32

Yossarianlives

31
1
3

3

votes

1 answer

Finding most statistically distinct text across classes

I want to find the most statistically distinct phrases in texts across different classes, where the texts have already been classified. Suppose I have ~100,000 short text documents, where each document has a label. Let's say I have 100,000 text…

logistic text-mining natural-language tf-idf

asked Mar 03 '17 at 04:28

cjrieds

133
4

3

votes

1 answer

TFIDF for feature selection

Input to tfidf is a list of documents, output is a matrix which contains a numerical value for each (word, document) pair. How can I use that matrix to perform feature selection, i.e. reduce the size of the dictionary?

feature-selection natural-language tf-idf

asked Aug 21 '16 at 18:55

Baron Yugovich

515
1
6
18

2

votes

1 answer

How to calculate tf-idf for a single term

I am following the tf-idf method described in this paper: Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network. In the paper I have linked above (see equation 2 in the paper), they have…

mathematical-statistics python scikit-learn data-mining tf-idf

asked Aug 19 '19 at 08:44

EmJ

592
3
15

2

votes

0 answers

Alternative to tf-idf with smaller penalty on previous usage

I am using tf-idf to find words that are particularly important to individual documents. This works pretty well for my purposes. However, one area where I feel like it isn't great is how harshly words are penalised for being in all other documents,…

tf-idf

asked Jun 28 '19 at 16:11

Jaccar

141
2

Questions tagged [tf-idf]