How to prepare a dataset for text classification

Question

I would like to compare some algorithms for performing sentiment classification (Naive Bayes, SVM, and Random Forest). So far, I have collected about 100 000 unique opinions with the following distribution:

10% negative
90% positive

After some pre-processing (removing stop words, stemming etc) with tm package I obtained a document-term-matrix with about 320,000 unique terms (100% sparsity). I have decided to narrow it down to 99,8% sparsity ending with about 1 400 terms.

So by now my data set has about 100 000 rows and 1 400 columns.

To apply Naive Bayes classificator (from klaR package) I have changed each occurrence of term in document to a simple factor - YES or NO indicating if the term occurred in document or not.

The problem is when I'm trying to perform 10-fold CV almost all of the created data sets have some terms with zero variance. As far I understand changing the Laplace correction factor should face this problem (but it didn't). How to approach this problem? What are recommended practices?

Solutions that comes to my mind:

Reduce data set size (for example 10 000 opinions in each class)
Stop-words can be represented as terms (lower probability of zero variance)
Use bootstraping instead of 10-fold CV (but then the algorithm complains about the duplicated rows)

score 1 · Answer 1 · answered Nov 29 '14 at 21:34

1

Take a look at the nearZeroVar function in caret. My opinion is that you are better off getting rid of extremely sparse and unbalanced predictors prior to model (or use a tree or other model that is not affected by such predictors).

Max

answered Nov 29 '14 at 21:34

topepo

5,820
1
19
24

score 1 · Answer 2 · answered Feb 26 '15 at 21:19

1

Another option would be to apply PCA to the data before running your algorithms. This has many nice properties, including making all of your predictors orthogonal and relatively dense.

answered Feb 26 '15 at 21:19

Zach

22,308
18
114
158

How to prepare a dataset for text classification

2 Answers2