I am reading Cawley and Talbot and saw a post on implementation of nested CV (NCV) as well as numerous posts with good answers on the topic (general NCV, training with full dataset after NCV, how to choose model after NCV).
I wrote a script that performs NCV on a 'visible' set after splitting the data into 'visible' and 'hidden' parts. We refit the model to the entire visible set---usually where the modeling process ends---, but then test this model on the magically appearing 'hidden' set. The error from the latter should roughly equal the generalization error estimated by using NCV on the visible set.
There are many knobs to turn that affect the estimation accuracy: eg, the number of inner folds, number of outer folds, number of iterations (of the full process), or the metric. We can change them here and see what happens, but what to do on a real dataset? Can I use this script on a subset of the real dataset to find settings that give accurate estimates and expect it to hold on unseen data?
O/P
MSE :
Est 3044.6858062+/-69.0466700133
True 3051.32410389+/-427.344654478
MAE :
Est 39.822042199+/-0.621534888251
True 39.2236773233+/-6.39557352798
R2 :
Est 0.479608389929+/-0.0169472816234
True 0.463311879751+/-0.113470834353
Update:
The results above are from the diabetes dataset, which has a continuous response variable. I repeated the experiment on the breast cancer dataset, which has a binary response:
pr (area under PR curve) :
Est 0.986205196083+/-0.00241897261673
True 0.989708379454+/-0.0107812698088
ll (neg log loss) :
Est 0.173785240219+/-0.00888222771163
True 0.164129744679+/-0.0802308030923
roc (area under roc) :
Est 0.978565008631+/-0.00248749514519
True 0.984685331718+/-0.0161458827103
bri (brier's score):
Est 0.0526782015835+/-0.00296152695846
True 0.0505493562734+/-0.0290771904303