2

I would like to classify a mail (spam = 1/ham = 0), using logistic regression. My implementation is similar to this implementation and using tensorflow.

A mail is represented as a bag-of-words vector, with each number in the vector representing how often a term appeared in a mail. The idea is to multiply that with a vector, and use the sign-function to turn regression into classification. $$y_{predicted} = \sigma(x_i^T\theta) $$, with $\sigma = \frac{1}{1 + e^{-x}}$. To calculate the loss, I am using the l2-loss (squared loss). Since I have a lot of trainig data, regularization seems not necessary (training and testing accuracy is always very close). Still I only get a max accuracy of about 90% (both training and testing). How can I improve this?

I already tried the following:

  • Use regularization, L1, L2 with different strength (seems not necessary)

  • Use different learning rates

  • Use gradient descent, stochastic gradient descent and batch gradient descent (the hope is to avoid local minima in the loss-function, by introducing more variance with stochastic/batch gradient descent)

  • create more training data (classes were disbalanced 80/20 spam/ham), using SMOTE

Things that I could still try:

  • use a different loss function

Any other suggestions?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
User12547645
  • 143
  • 5

1 Answers1

6

L2 loss for logistic regression is not convex, but the cross entropy loss is. I’d recommend making the switch because convexity is a really nice property to have during optimization. Convexity implies that you don’t have to worry about local minima because they don’t exist by definition.

A nice discussion of the mathematics comparing the convexity of log loss to the non-convexity of L2 loss can be found here: What is happening here, when I use squared loss in logistic regression setting?

The textbook way to estimate logistic regression coefficients is called Newton-Raphson updating, but I don't believe that it is implemented in TensorFlow since second-order methods are not generally used for neural networks. However, you might improve the rate of convergence if you use SGD + classical momentum or SGD + Nesterov momentum. Nesterov momentum is especially appealing in this case: since your problem is convex, the problem is more-or-less locally quadratic, and that is the use case where Nesterov momentum really shines.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Thank you very much for the suggestion. I will have a look into it and then repo how good a result it gave me – User12547645 Aug 18 '18 at 15:12
  • Thank you again! I am not at more than 98% accuracy for training and testing, with training still going – User12547645 Aug 18 '18 at 20:58
  • 1
    That sounds like a pretty nice improvement, though. Almost 10%! -- in your post, you said you were getting 90% accuracy. – Sycorax Aug 18 '18 at 21:08
  • Yes, it is very impressive indeed! And it seems as though I still do not need any regularization, since training and testing accuracy are fairly close together – User12547645 Aug 19 '18 at 09:52