How to calculate tf-idf for a single term

Question

I am following the tf-idf method described in this paper: Measuring, Predicting and Visualizing Short-Term Change in Word Representation and Usage in VKontakte Social Network.

In the paper I have linked above (see equation 2 in the paper), they have got only a single tf-idf value for each word (w) for each week (t) as follows.

For example, consider the below graph that I took from the above paper.

It shows how tf-idf value of the word putin changed over weeks. i.e. one tf-idf value for the word putin in each week.

I would like to implement the tf-idf approach that they have suggested. In other words, I would like to calculate a single tf-idf value the word in each time period. However, I am struggling a way to implement this in python.

Currently I am using sklearn library to implement this. However, in the tutorials that I follow, a word can have mutiple tf-idf values in a t timeperiod. For example, consider the below documents in t timeframe.

The tf-idf values we get are as follows.

For example, consider the word "method", it has 3 tf-idf scores according to my sklearn implementation. Hence, I am not sure if I am following the paper correctly.

My preferred language is python.

I am happy to provide more details if needed.

yoav_aaa · Accepted Answer · 2019-08-19T14:29:52.897

The modeling strategy suggested in the paper refers to temporal representation(both frequency and context) of words.
From what I understand, they attempt to learn the changes in these representations across time.
One such representation is based on the tf-idf method.
In the mentioned equation, the parameters $t$ indicates week's corpus.
This means that each word, will have $n$ tf-idf representations - one per each of the $n$ weeks relevant to the modeling.

One way implementing this if fitting a new tf-idf transformer per each week, and keeping each (word,week) representation in a dictionary.

Then its possible viewing the changes in each words representation across time.

EDIT:
The Word Usage Dynamics statistic being used is per all posts(documents) and not per document. Meaning each word should have only one value per week From what I gathered, there is no straight forward implementation for this in Sklearn, but possibly in NLTK/Genism.

Still, it seems quiet simple implementing on your own:

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, _document_frequency
 corpus = [
 'This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?',
 'Is this the second cow?, why is it blue?',
  ]

 count_vec = CountVectorizer(binary=False)
 count_df = count_vec.fit_transform(corpus)
 transformer = TfidfTransformer(use_idf=True, smooth_idf=False)
 X1 = transformer.fit_transform(count_df)
 posts_cnt = len(corpus)

 ##calculating tf-idf per word on all documents - using sklearn _document_frequency
 vals = [math.log(x) * math.log(posts_cnt/float(y)) for x, y in 
 zip(count_df.sum(axis=0).tolist()[0], _document_frequency(X1))]
 ## mapping tf-idf vals to original words
 {k: vals[v] for k, v in count_vec.vocabulary_.items()}

thank you very much for the answer. yes, you are correct. But what I am not clear is how they have calculated `tf-idf` for each word in each week. If you see the example in this link: https://www.ir-facility.org/scoring-and-ranking-techniques-tf-idf-term-weighting-and-cosine-similarity a word can have multiple tf-idf values even in a same week. Please kindly let me know your thoughts :) — EmJ, Aug 19 '19 at 12:22
I updated my question by explaining more about my current problem :) — EmJ, Aug 19 '19 at 12:38
thanks a lot for the edit. I saw it just now. I will run the code and let you know if it worked :) — EmJ, Aug 19 '19 at 23:18
thanks for the explanation too. I understand it now. However, one quick question, are we using the tf-idf formular that they have mentioned in the paper? or can tf-idf formulars change according to the library that we use? :) — EmJ, Aug 19 '19 at 23:20

How to calculate tf-idf for a single term

1 Answers1