Document summarization with Log-likelihood ratio

Question

I am trying to implement a text summary using Log-Likelihood Ratio. As explained in https://www.cs.bgu.ac.il/~elhadad/nlp16/nenkova-mckeown.pdf under section 2.1 I do not understand what do they really mean by background and foreground corpus. In case of having just one document does the foreground corpus mean the sentence for which I am computing LLR and background corpus mean the rest of document?

score 1 · Accepted Answer · answered Oct 24 '16 at 09:30

Well, let for a moment take your question as our document. Lets analyse it. For example it contains words like:
I, am, as, in, the
What can we conclude from these 5 words? Well, we can be pretty sure that the text is in English. But in finding the topic of the question, they aren't of much use. As long as the question is in English, a question about candy probably will have these words in it as well. They only confirm what we already know (that the question is indeed in English). So we call these words the background.
Now consider words like:
corpus, LLR, document, text, summary
These words are much rarer in English texts, but they do appear in your question. That makes them much more useful in determining the topic of your question. A question about candy is very unlikely to contain the word 'corpus'. So we call these words the foreground.

Here I am of course simplifying things. A random English text might for example also have the word 'Document' in it. However for such a random English text, the word document might appear on average 1 time out of 1,000 words. Suppose now that you have a text of 15,000 words where the word 'Document' appears 40 times. From the background alone, we expect the word document to appear 15 times on average. So those 15 times are part of the background, we assume those are simply there because it is an English text of 15,000 words. However that leaves us with 25 occurrence of the word 'Document' which are unexplained by the background. So we assume those are part of the foreground, that they are there specifically because they relate to the topic of the text.

Another simplification I made can be exemplified by the word 'computing'. Is this a common background word? Well, in the English language as a whole, this might be a pretty rare word (I didn't check). But in questions on cross-validated, it might be a pretty common word. So the question then becomes, what do you take as your background? General English language texts, or cross-validated questions? And if you choose the general English language, how do you find a representative selection of English language texts? These are all things you can consider when doing topic analysis on text.

Thank you for the clarification between foreground/background terms. If we have corpus of a specific topic, e.x. documents about automobiles or finance then I guess applying tf-idf raking and finding an appropriate threshold could help in differentiating between foreground/background words. But in case of designing a stand alone summarizer without any prior knowledge of the document I guess this approach makes less sense to be used. Probably TextRanking algorithm would be a better approach. Do you have any thoughts on that? — Tacy Nathan, Oct 24 '16 at 10:36
Take a look at smmry:http://smmry.com/about This gives an overview of a way of summarizing text, unfortunately their summary is quiet short. — dimpol, Oct 24 '16 at 10:48

Document summarization with Log-likelihood ratio

1 Answers1