10

A traditional approach of feature construction for text mining is bag-of-words approach, and can be enhanced using tf-idf for setting up the feature vector characterizing a given text document. At present, I am trying to using bi-gram language model or (N-gram) for building feature vector, but do not quite know how to do that? Can we just follow the approach of bag-of-words, i.e., computing the frequency count in terms of bi-gram instead of words, and enhancing it using tf-idf weighting scheme?

Antoine
  • 5,740
  • 7
  • 29
  • 53
user3125
  • 2,617
  • 4
  • 25
  • 33

3 Answers3

4

Yes. That will generate many more features though: it might be important to apply some cut-off (for instance discard features such bi-grams or words that occur less than 5 times in your dataset) so as to not drown your classifier with too many noisy features.

ogrisel
  • 3,669
  • 22
  • 19
  • Thanks. Do you mean that my general idea of computing each feature value in terms of bigram (N-gram) is correct? In other words, there is no big difference in computing the feature values between bag-of-words and N-gram model. Thanks for the clarification. – user3125 Apr 02 '12 at 14:44
  • Yes, you can use both all bigrams + unigrams (words) in a big bag of features (as long as you trim the least frequent with some cut-off level). – ogrisel Apr 02 '12 at 17:05
3

The number of bigrams can be reduced by selecting only those with positive mutual information.

We did this for generating a bag of bigrams representation at the INEX XML Mining track, http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp.

What we did not try is using the mutual information between the terms in weighting the bi-grams. See https://en.wikipedia.org/wiki/Pointwise_mutual_information , https://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/philip-pmi.pdf and http://www.nltk.org/howto/collocations.html for a better explanation of pointwise mutual information for bigrams.

See https://stackoverflow.com/questions/20018730/computing-pointwise-mutual-information-of-a-text-document-using-python and https://stackoverflow.com/questions/22118350/python-sentiment-analysis-using-pointwise-mutual-information for other questions related to this.

0

Using random projections to reduce the dimensionality of the data may prove useful to reduce the the space required to store the features, https://en.wikipedia.org/wiki/Random_projection. It scales very well and every example can be projected to a lower dimensional space independently and without any direct optimization methods such as PCA, SVD, Sammon Maps, NMF, etc.