Regarding using bigram (N-gram) model to build feature vector for text document

Question

A traditional approach of feature construction for text mining is bag-of-words approach, and can be enhanced using tf-idf for setting up the feature vector characterizing a given text document. At present, I am trying to using bi-gram language model or (N-gram) for building feature vector, but do not quite know how to do that? Can we just follow the approach of bag-of-words, i.e., computing the frequency count in terms of bi-gram instead of words, and enhancing it using tf-idf weighting scheme?

score 4 · Accepted Answer · answered Apr 02 '12 at 14:25

4

Yes. That will generate many more features though: it might be important to apply some cut-off (for instance discard features such bi-grams or words that occur less than 5 times in your dataset) so as to not drown your classifier with too many noisy features.

answered Apr 02 '12 at 14:25

ogrisel

3,669
22
19

Thanks. Do you mean that my general idea of computing each feature value in terms of bigram (N-gram) is correct? In other words, there is no big difference in computing the feature values between bag-of-words and N-gram model. Thanks for the clarification. – user3125 Apr 02 '12 at 14:44
Yes, you can use both all bigrams + unigrams (words) in a big bag of features (as long as you trim the least frequent with some cut-off level). – ogrisel Apr 02 '12 at 17:05

score 3 · Answer 2 · edited May 23 '17 at 12:39

The number of bigrams can be reduced by selecting only those with positive mutual information.

We did this for generating a bag of bigrams representation at the INEX XML Mining track, http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp.

What we did not try is using the mutual information between the terms in weighting the bi-grams. See https://en.wikipedia.org/wiki/Pointwise_mutual_information , https://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/philip-pmi.pdf and http://www.nltk.org/howto/collocations.html for a better explanation of pointwise mutual information for bigrams.

See https://stackoverflow.com/questions/20018730/computing-pointwise-mutual-information-of-a-text-document-using-python and https://stackoverflow.com/questions/22118350/python-sentiment-analysis-using-pointwise-mutual-information for other questions related to this.

@Renaud Links have been updated :-) – Chris de Vries Nov 29 '16 at 01:53 — Chris de Vries, Nov 29 '16 at 01:53

score 0 · Answer 3 · answered Nov 29 '16 at 01:58

Using random projections to reduce the dimensionality of the data may prove useful to reduce the the space required to store the features, https://en.wikipedia.org/wiki/Random_projection. It scales very well and every example can be projected to a lower dimensional space independently and without any direct optimization methods such as PCA, SVD, Sammon Maps, NMF, etc.

Regarding using bigram (N-gram) model to build feature vector for text document

3 Answers3

Linked