I would like to compare some algorithms for performing sentiment classification (Naive Bayes
, SVM
, and Random Forest
).
So far, I have collected about 100 000 unique opinions with the following distribution:
10% negative
90% positive
After some pre-processing (removing stop words, stemming etc) with tm
package I obtained a document-term-matrix with about 320,000 unique terms (100% sparsity). I have decided to narrow it down to 99,8% sparsity ending with about 1 400 terms.
So by now my data set has about 100 000 rows and 1 400 columns.
To apply Naive Bayes classificator (from klaR
package) I have changed each occurrence of term in document to a simple factor - YES or NO indicating if the term occurred in document or not.
The problem is when I'm trying to perform 10-fold CV almost all of the created data sets have some terms with zero variance. As far I understand changing the Laplace correction factor should face this problem (but it didn't). How to approach this problem? What are recommended practices?
Solutions that comes to my mind:
- Reduce data set size (for example 10 000 opinions in each class)
- Stop-words can be represented as terms (lower probability of zero variance)
- Use bootstraping instead of 10-fold CV (but then the algorithm complains about the duplicated rows)