Improving spam classification with tensorflow logistic regression

Question

I would like to classify a mail (spam = 1/ham = 0), using logistic regression. My implementation is similar to this implementation and using tensorflow.

A mail is represented as a bag-of-words vector, with each number in the vector representing how often a term appeared in a mail. The idea is to multiply that with a vector, and use the sign-function to turn regression into classification. $$y_{predicted} = \sigma(x_i^T\theta) $$, with $\sigma = \frac{1}{1 + e^{-x}}$. To calculate the loss, I am using the l2-loss (squared loss). Since I have a lot of trainig data, regularization seems not necessary (training and testing accuracy is always very close). Still I only get a max accuracy of about 90% (both training and testing). How can I improve this?

I already tried the following:

Use regularization, L1, L2 with different strength (seems not necessary)
Use different learning rates
Use gradient descent, stochastic gradient descent and batch gradient descent (the hope is to avoid local minima in the loss-function, by introducing more variance with stochastic/batch gradient descent)
create more training data (classes were disbalanced 80/20 spam/ham), using SMOTE

Things that I could still try:

use a different loss function

Any other suggestions?

Sycorax · Accepted Answer · 2018-08-18T21:07:49.513

6

L2 loss for logistic regression is not convex, but the cross entropy loss is. I’d recommend making the switch because convexity is a really nice property to have during optimization. Convexity implies that you don’t have to worry about local minima because they don’t exist by definition.

A nice discussion of the mathematics comparing the convexity of log loss to the non-convexity of L2 loss can be found here: What is happening here, when I use squared loss in logistic regression setting?

The textbook way to estimate logistic regression coefficients is called Newton-Raphson updating, but I don't believe that it is implemented in TensorFlow since second-order methods are not generally used for neural networks. However, you might improve the rate of convergence if you use SGD + classical momentum or SGD + Nesterov momentum. Nesterov momentum is especially appealing in this case: since your problem is convex, the problem is more-or-less locally quadratic, and that is the use case where Nesterov momentum really shines.

edited Aug 18 '18 at 21:07

answered Aug 18 '18 at 13:44

Sycorax

76,417
20
189
313

Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me – User12547645 Aug 18 '18 at 15:12
Thank you again! I am not at more than 98% accuracy for training and testing, with training still going – User12547645 Aug 18 '18 at 20:58
1

That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy. – Sycorax Aug 18 '18 at 21:08
Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together – User12547645 Aug 19 '18 at 09:52

Improving spam classification with tensorflow logistic regression

1 Answers1