1

I ran logistic regression on a data of 3700 patients. I have 9 variables and my outcome is presence of a disease or not. I got the regression coefficients and predicted probabilities. When I apply this model on another data set, no matter what I do the area under ROC curve does not go above 56%.

I am assuming there is underfitting in my model. How can I improve this and reduce the high bias? Any way to calculate the bias in a software? How can I fix this underfit in a software?

Thank you very much to anyone who provides a solution.

Faiz_Yusufi
  • 11
  • 1
  • 6

1 Answers1

1

From what you described, it is hard for me to say it is under-fitting. It is even possible over-fitting. I would suggest to use "learning curve" plot to check the problem.

How to know if a learning curve from SVM model suffers from bias or variance?

Suppose you verified it is under-fitting. Basis expansion can be used to increase the variance of the model. The basis expansion can be polynomial expansion or spline expansion. Details and examples can be found in

Why are there large coefficents for higher-order polynomial

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Thank you very much Sir. Can we still apply learning curves for training and test data when the sample size is different for both? – Faiz_Yusufi Nov 16 '16 at 13:17
  • @Faiz_Yusufi Yes. I think training and testing does not need to be have same size. – Haitao Du Nov 16 '16 at 14:05
  • Ok Sir. I will learn the learning curve and perform it on my data. Can I please have some way to contact you to ask for more assistance and inform about my results? – Faiz_Yusufi Nov 17 '16 at 19:44
  • @Faiz_Yusufi i do not think that is the way Stack Exchange works. People may busy for other issues and may not want to leave contact info. – Haitao Du Nov 17 '16 at 19:46
  • No problem Sir. I will work on my problem and update the progress here. Thank you so much. – Faiz_Yusufi Nov 17 '16 at 20:26
  • I think my data is overfitting from what the two links explain. So, how do I tackle this problem now? Can I transform my regression coefficients to some suitable value? What is the ideal method to overcome overfitting and in which software? – Faiz_Yusufi Nov 17 '16 at 20:30
  • @Faiz_Yusufi this post (my another answer) explains how to fix the overfitting on logistic regression. http://stats.stackexchange.com/questions/228763/regularization-methods-for-logistic-regression/228785#228785 – Haitao Du Nov 17 '16 at 21:22
  • Again thank you very much Sir. I will update my progress here once I learn and apply it. – Faiz_Yusufi Nov 18 '16 at 15:08
  • In my data I have 8 variables and a binary outcome variable for the two data sets. The predicted probabilities of the training data after logistic regression are properly distributed between 0 and 1. While for the test data all the probabilities are either exactly 0 or 1. Can you please explain how and using which variables and software can I plot the learning curve? I do not know the procedure. Thank you. – Faiz_Yusufi Nov 19 '16 at 15:28