For logistic model, I divided dataset into two parts training sample (70% of 360 data point (observation)) and test sample (rest 30% of 360) randomly. After that I built logistic model on training sample and area under curve(AUC) is coming around 82% and Hosmer-Lemeshow (HL) test is also throwing the positive result (H0 accepted p-value = .675) with 11 variables out of 67 but same model when I checked on test sample i found AUC around 71% and HL test showing positive result. Could you please tell me what will be reason behind the AUC drop form 82% to 71% ? and what will be the solution for this?
Asked
Active
Viewed 100 times
2
-
6It might be overfitting your training data because you have relatively few samples compared to number of variables. Try to add some L2 regularization and use cross validation to find the right amount. – stmax Oct 01 '15 at 06:40
-
3An AUC of 71% is pretty good. The reason for the drop is likely due to overfitting. It could also be caused some some unique features of the validation datasets such as outliers. Have you examined the distributions of the covariates between the training and validation datasets as well? It is possible that there are systematic differences between the two datasets by chance. – StatsStudent Oct 01 '15 at 07:20
-
Just to clarify, I should say that an AUC is pretty good depending on the subject matter. In social sciences an AUC can be pretty good, but in chemistry, physics, etc. that could be fair to poor. – StatsStudent Oct 01 '15 at 08:05
-
@StatsStudent, I have already taken care of outliers and i did outlier treatments also. For over-fitting problem, i revisited at least 5 times and the present model is looking fine. Could you please suggest me what else can i do for over-fitting problem and to increase the AUC in the test sample? – user43247 Oct 05 '15 at 09:52
-
Here is something you could try. After you have split your data into a training dataset (TD) and a validation dataset (VD), use a bootstrap aggregating procedure or "bragging" procedure for variable selection. Using this method you generate, say 10K replicated datasets by drawing samples with replacement from your TD so that each replicate is the same size as the original TD. Perform an automated variable selection method on each of the replicated dataset, keeping track of which variables are deemed significant in the process. After iterating through the 10K samples, (continued). . . – StatsStudent Oct 06 '15 at 02:54
-
(continued from above) determine which variables/coefficients have been selected most of the time (say 90%). Then include those in your final model. Validate the model against your training dataset. You can even do bootrapping of your validation data as well and obtain the average AUC. See how well your model performs doing this. – StatsStudent Oct 06 '15 at 02:56
-
Another approach is to find additional variables that may be more predictive. It's quite possible that with the variables you have, a 71% is the best you can do. There is no guarantee that you are going to achieve an AUC of 80% or so in with your validation data just because you did so in your training data. It's possible your predictors just aren't very predictive. – StatsStudent Oct 06 '15 at 03:05
-
With so few data you should not use split-data validation, see https://stats.stackexchange.com/questions/50609/validation-data-splitting-into-training-vs-test-datasets – kjetil b halvorsen May 03 '21 at 20:02