1

I'm performing text classification experiments with Scikit Learn, on a small dataset (100 samples of labelled texts by patients and controls). I tested two supervised machine learning methods: SVM (kernel=RBF) and L2-regularized logistic regression.

The feature selection methods is always the same: SelectKbest (k=200) with mutual information. I experimented with two sets of features: one including character n-grams and one without them. Feature extraction, selection an dclassification are done in a pipeline, to cross validate all steps.

My results showed that:

  • the SVM classifier performs better than the logistic regression on the feature set with the character n-grams.

  • the SVM classifier performs worser than the logistic regression on the feature set without the character n-grams.

I'm a beginner in supervised machine learning and I'm struggling with all the different possibilities. Now, I can't find an explanation for this results, I can't wrap my head around it. Any suggestions?

Bambi
  • 11
  • 1
  • 1
    It will help to know why you care which one is doing better or worse? Are you trying to get an even better model? Are you trying to learn something about the dataset by investigating this result? – jds Jul 28 '17 at 17:22
  • 1
    @jds I'm trying to make a comparison between both models, to see which model in combination with which feature set performs best on my dataset. – Bambi Jul 29 '17 at 10:25

0 Answers0