3

I am using a LR model and its got 80% prediction accuracy on test data. For the 20% where it has predicted wrongly, I know the right answer of course. I wonder if there is some optimisation method where I can take the trained LR model and iterate the weights inside the model until the 20% failures are 10% for example. I could use an Evolutionary Strategy maybe. Has anybody done that with success or would this be a bad idea because it would lead to an overfit model?


I tried what I suggested and got a small improvement in accuracy but not much, less than 0.5 percent. This was with using hillclimbing to improve the weights from the model. I am trying with genetic algorithms next. I am using binary classification with a balanced dataset so threshhold 0.5 should be good, as far as i know.

My data is balanced, so accuracy is a good measure of model quality, as far as i know.

I am using Python sklearn implementation of LR , by the way. I already optimised the two parameters "C" and "penalty".

brownie74
  • 113
  • 4
  • 8
    Logistic regression *is* optimized, with respect to likelihood, but not with respect to accuracy. [Accuracy is not the best way to measure model quality.](https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models) The proposed procedure will result in a bogus model. It sounds like OP has two unspoken questions: (1) how to measure model quality and (2) how to improve a logistic regression model. – Sycorax Jan 12 '22 at 13:48
  • Having selected that model (inputs etc), you could retrain the model with train and test data combined. (You would no longer have any data to assess how accurate the model was on unseen data..) – seanv507 Jan 12 '22 at 13:52
  • 6
    What is being suggested is not based on good statistical principles. The gold standard optimality criterion is the likelihood; no two-step procedure required. But a lot of confusion has been caused by applying classification error measures to non-classification problems. Probabilities are not "correct" or "incorrect". – Frank Harrell Jan 12 '22 at 14:23
  • 1
    What you're suggesting is evocative of, if not equivalent to, [boosting](https://en.wikipedia.org/wiki/Boosting_(machine_learning)), where you train again with weights applied to give more weight to bad misses; I have added the [tag:boosting] tag. While I do not have expertise in boosting methods to expand on this comment, it seems that you can apply this to the proper scoring rules discussed in the other comments, such a the likelihood (equivalent to crossentropy loss and negative log likelihood loss), not just with hard classifications. – Dave Jan 12 '22 at 14:37
  • 1
    You might want to try a non-linear version of LR, such as Kernel Logistic Regression if it is a non-linear problem. I very much like Frank Harrell's 'Probabilities are not "correct" or "incorrect".' - nicely put! There are circumstances where you *may* be interested in accuracy (see many applications of SVMs), but if you are using LR you probably want probabilities, so you should use a criterion that measures the quality of the estimates of probability, not accuracy. – Dikran Marsupial Jan 12 '22 at 16:07
  • Regularised logistic regression may also be worth a try (may be an over-fitting issue?). – Dikran Marsupial Jan 12 '22 at 16:08
  • OK, i will try and ROC curve to gauge quality. I'll report back if it helps. Incidentally, in my problem the cost of a false pos is same as cost of a false neg – brownie74 Jan 12 '22 at 16:43
  • AUC = 0.906, seems OK, right? Class balance is 936:928 ... Test Accuracy was 83% by the way. – brownie74 Jan 12 '22 at 17:03
  • Keep in mind that by default sklearn is applying an arbitrarily chosen l2 regularization parameter in LogisticRegression models, you should either turn that off or tune it. – Jonny Lomond Jan 12 '22 at 17:19
  • @JonnyLomond When you say "arbitrarily chosen l2 regularization parameter" you mean it choosese "L2" or "L1", right? Not the actual lasso regularisation number. I use LBFGS and L2 by default. You cannot set the L2 regularisation value, only turn it on or off. It's better on, right? I am doing binary classification. – brownie74 Jan 12 '22 at 17:34
  • @JonnyLomond I have played with all settings and confident model is optimized. I have tried many different values. – brownie74 Jan 12 '22 at 17:45
  • In sklearn you can set the l2 penalty, it's called "C" in the model params. Edit: my mistake I didn't see your update – Jonny Lomond Jan 12 '22 at 17:53
  • I am now using AUC to gauge quality of model, and by increasing size of training set i have increased AUC from 0.9 to 0.93 .. and test accuracy has gone to 85%. I'm happy with that for this evenings efforts. Thanks everyone for inputs. – brownie74 Jan 12 '22 at 17:53

1 Answers1

7
  • As noticed by others, accuracy is not the best metric for judging quality of the model.
  • Logistic regression predicts probabilities, so to calculate accuracy you must have used some threshold for making hard classifications. If you used the "default" $p>0.5$, it is not necessarily an optimal choice. There are many methods for picking the threshold. This is something you could tune.
  • As about the parameters of the model, you already found the optimal parameters given the data, model, and the hyperparameters, so there is nothing to improve further. If the results are not satisfactory, you can tune the hyperparameters, or try different model (either using another model type altogether, such as , or attempting further refinement of logistic regression, perhaps using a richer set of features).
Sycorax
  • 76,417
  • 20
  • 189
  • 313
Tim
  • 108,699
  • 20
  • 212
  • 390