3

I have a corpus with 6040592 words and 309074 types (different words). Knowing this information it is possible to know the optimal size of bag of words vectors in order to represent phrases?

I am using a data structure like this:

{'contains(The)': True, 'contains(waste)': False, 'contains(lot)': False, ...}

To represent this:

 The movie is lovely.

Do Zipf's law could help to know how many words include in the model?

Tim
  • 108,699
  • 20
  • 212
  • 390
alemol
  • 131
  • 2
  • There are approximately 200k words in the English dictionary, you got 6M..... – Uri Goren Feb 26 '16 at 20:11
  • 1
    Have you removed punctuation and stop words from your corpus? Have you done any type of filtering? Where is the corpus from? Like @UriGoren said, it seems that you have too many different words. – Armen Aghajanyan Feb 26 '16 at 21:29
  • Where are we on this question ? did you receive an answer ? – Uri Goren Mar 01 '16 at 20:18
  • Some extra information: - The corpus is not in English - No filters were applied - My counts were approximated using the following basic bash commands: for tokens \$wc corpus, and for types \$cat corpus | tr " " "\n" | sort | uniq -c | wc -l – alemol Mar 01 '16 at 21:27
  • But the question is if an optimal number of features could be assessed, It is assumed that use the entire dictionary is a trivial solution. – alemol Mar 01 '16 at 21:37

2 Answers2

1

Usually when using bag of words approaches, the entire dictionary is used.

The "Vector", which is more precisely called the Bag of words (BOW) if the counts of each word.

Note that stop-words, punctuation, and stemming are usually applied prior to the BOW counting, so I would expect to see much less words in the dictionary.

The Zipf distribution basically shows us the likelihood of encountering an unseen word, given our current counts.

Unseen words can be treated with Dirichlet prior, Lidstone smoothing, or even (god forbid) just filtering them out of the dictionary.

Uri Goren
  • 1,701
  • 1
  • 10
  • 24
1

The answer is problem-specific. Imagine that you are building a classifier of something rather generic, for example, you use the bag-of-words as a feature for a classifier that marks the sentence as talking about food vs something else. I guess that in such a case, you could discard all the stopwords and probably the vast majority of all the other words except the most common ones, and even if you were left with sentences such as "<unk> <unk> hot-dog <unk> great" it would be enough for something like naive Bayes to decide that it talks about food. Now imagine that you're building a different classifier, that detects something more esoteric and context-dependent, where many rare words can make a difference in the meaning of a sentence (sarcasm, for an extreme example). In such a case, you would need to preserve many more words for it to work.

The best way to decide if your bag-of-words is big enough is to do this empirically, by verifying how chaining its size impacts the model performance (so the metrics, but also edge cases, doesn't it make it more biased, or less fair, etc). You can start with the biggest you can buy and verify how much smaller can it be made without an unacceptable drop in performance. On the technical side, you can use things like hashing trick to shrink the large bag-of-words into smaller cardinality.

Tim
  • 108,699
  • 20
  • 212
  • 390