7

I'm running a basic language classification task. There are two classes (0/1), and they are roughly evenly balanced (689/776). Thus far, I've only created basic unigram language models and used these as the features. The document term matrix, before any reductions has 125k terms. I've reduced this to ~1250 terms that occur in more than 20% of all documents.

Training on this dataset gives me my best-performing model to date:

library(e1071)
index <- 1:nrow(df.dtm)
testindex <- sample(index, trunc(length(index)/3))
testset <- df.dtm[testindex,]
trainset <- df.dtm[-testindex,]
wts <- 100/table(trainset$labs)
tune.out=tune(svm, labs~., data=trainset, class.weights=wts,
              ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10, 100),
                          gamma=c(0.005,.015, 0.01,0.02,0.03,0.04,0.05)))
bestmod <- tune.out$best.model
ypred<-predict(bestmod, testset)
table(predicted=ypred, truth=testset$labs)

         truth
predicted   0   1
        0  36  29
        1 200 223

As you can see, performance is not good. But at least it's predicting some in the 0 class! In the majority of models I've run so far, performance looks quite a bit worse than this. For instance, the exact same setup, but using tf-idf instead of term frequency:

         truth
predicted   0   1
        0   1   0
        1 236 251

This is more typical of the models I've run. Furthermore, I've had the same results in python using scikitlearn.

I thought maybe there was something fishy with some of the features, so I decided to try taking random subsets of the features and fitting models to those. Here's what happens when I select 10% and run the same model:

         truth
predicted   0   1
        0 116 123
        1 106 143

So okay, performance isn't great, but at least I'm getting some predictions in the 0 class. Why are the predictions so strongly weighted toward one class above when I include all the features?

Is this expected behavior due to poor (/not really any) feature selection? I would have expected that classification would have looked more like a coin flip in that case, not a strong weighting toward selecting one class...

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
triddle
  • 460
  • 3
  • 7

2 Answers2

1

Interesting.. Hard to answer the question directly. Two things I would try to diagnose would be:

1) How do logistic regression and random forest fare?

2) By "fare", I suggest you look at the calibrations of the classifiers. What do the bins look like? Binarized posterior class probabilities will not be very helpful.

Zhubarb
  • 7,753
  • 2
  • 28
  • 44
0

I'm not sure, but I would suggest trying to add values to c and gamma parameters when tuning them.

The reason I say that is because gamma defines a sort of "smoothness" of classification. That's to say a very small value of gamma means any close point will be considered having the same target (thus putting everything in one class when they are a bit similar).

While it is great that you used a logarithmic scale for your tuning, we actually go further than "0.05" for gamma. I usually range from ($2^{-15}$ to $2^5$). Try to add ( $0.5 , 5$ ) for exemple.
(Ps : You could use 10^(-10:5) wich I find easier to write in R)

Not so long ago, I had a classifier that had ($C= 100 , gamma= 10$) as it's best parameters, and it appeared that it gave very poor results for $gamma<10$ .

I hope this helps. If it doesn't, could you post your results for a linear kernel svm ?

  • I think you would want to reduce C in this case. That makes the hyperplane smoother and less fit. – B Seven Feb 11 '19 at 04:53