I am working on a movie review sentiment analysis project, and comparing various classifiers on the same dataset. The data for the two classes is balanced, so I'm using accuracy on 3-fold cross validation as the basic measure of performance.
How can I check whether one classifier is better than another, with statistical significance? Is this a test I can do directly on the accuracy values, or do I need multiple accuracy values (from the multiple folds), or even individual classifications for each instance of data? Is the pair test applicable here?
Details: Dataset: 1000 positive, 1000 negative reviews. Bag of unigrams (words). Classifiers: Naive Bayes, SMO and LogReg Evaluation: Single accuracy percentage at the end of 3-fold stratified cross validation for each of the classifiers.