3

I would like to compare some algorithms for performing sentiment classification (Naive Bayes, SVM, and Random Forest). So far, I have collected about 100 000 unique opinions with the following distribution:

10% negative
90% positive

After some pre-processing (removing stop words, stemming etc) with tm package I obtained a document-term-matrix with about 320,000 unique terms (100% sparsity). I have decided to narrow it down to 99,8% sparsity ending with about 1 400 terms.

So by now my data set has about 100 000 rows and 1 400 columns.

To apply Naive Bayes classificator (from klaR package) I have changed each occurrence of term in document to a simple factor - YES or NO indicating if the term occurred in document or not.

The problem is when I'm trying to perform 10-fold CV almost all of the created data sets have some terms with zero variance. As far I understand changing the Laplace correction factor should face this problem (but it didn't). How to approach this problem? What are recommended practices?

Solutions that comes to my mind:

  1. Reduce data set size (for example 10 000 opinions in each class)
  2. Stop-words can be represented as terms (lower probability of zero variance)
  3. Use bootstraping instead of 10-fold CV (but then the algorithm complains about the duplicated rows)
Khozzy
  • 263
  • 1
  • 2
  • 9

2 Answers2

1

Take a look at the nearZeroVar function in caret. My opinion is that you are better off getting rid of extremely sparse and unbalanced predictors prior to model (or use a tree or other model that is not affected by such predictors).

Max

topepo
  • 5,820
  • 1
  • 19
  • 24
1

Another option would be to apply PCA to the data before running your algorithms. This has many nice properties, including making all of your predictors orthogonal and relatively dense.

Zach
  • 22,308
  • 18
  • 114
  • 158