Nested Cross Validation: How to do the whole Shebang (Algorithmic Selection, Model Selection, Parameter Tuning, Preprocessing)

Question

First post! If you don't want to read the background you can skip to the Problem heading below.

Background

Hello everyone, I'm a Physics student doing physics education research. My professor wants to build a model to classify students that are at risk of failure in a 15 week course. Getting the data and putting it into the right formats was the tedious part but now we have data for N = 10000 students with both in-class (HW 1 Score, Quiz 1 Score, Lab 1 Score, all the way up to Midterm 1) and out-of-class (gender,age,high school scores, SAT scores, ACT scores, current college GPA, Major, calculus scores) features.

Usually we perform ad-hoc analysis on excel (I know. Education Research still uses ancient technology for analysis) but since I've used Python for loads of Physics related simulations my professor gave this to me as a project for Summer.

From my fair share of reading books, online posts and papers here are some conclusions I've made.

Cross Validation is used to test the generalization of a model
Nested Cross Validation contains two cross validations loops where we use the inner cv loop for gridsearch (e.g. K-Fold CV for every available model) and the outer loop is used to measure the model performance. Reference

Both CV & Nested CV test the generalization of a model but Nested CV is more robust as it provides additional insight on model stability (i.e. variance of the generalization as training data is sliced differently) which could be a factor in algorithmic selection.

Given sci-kit learns nested cv example and this implementation of nested cv here what I have currently for my workflow process

Problem

Task : Create an early student failure detection algorithm that will predict the probability of whether or not a student will fail by week five of the course.

Data: N = 10000 students with both in-class (HW 1 Score, Quiz 1 Score, Lab 1 Score, all the way up to Midterm 1) and out-of-class (gender,age,high school scores, SAT scores, ACT scores, current college GPA, Major, calculus scores) features.

Steps:

Split data between training and testing sets (80% training, 20% testing)

Algorithmic Selection (SVC, Random Forest, NN, Logistic Regression)

Nested CV On Training Data

Stratified K-Fold CV On Training Data to get inner_training and inner_validation sets (outer cv layer)

Preprocess data (scaling, dealing with na_values, dealing with categorical etc...)

GridSearch Model on inner_training data (inner cv layer)

Fit the best performing GridSearch Model on the inner_validation data

Get the unbiased prediction error for this model

Repeat for all algorithms

Select best algorithm based on unbiased prediction error (let's assume Random Forest is the best algorithm)

Run a loop on your hyper parameters for your model (Random Forest)

Split training data using some cross-validation

Preprocess data

Fit your model with the parameters for this loop

Find the best model parameters

Now that you've selected the model with the best hyper parameters retrain on your entire training dataset and finally predict on the test set. This gives you an overall indicator of your model performance and because of the previous steps you can trust that you've build the most robust model and desirable model that will give you the best generalization on any future data

And you've done it. You've ended world hunger. The world rejoices. Kids prefer you to come down the chimney instead of Santa. Your dreams are no longer dreams but a reality that you've molded together.

There is also a paper on my topic specifically here if you want a little bit more context. This paper is why my professor wanted me to do research on this subject.

To be a bit more specific here are the questions I have:

Does this workflow process correctly implement cross validation and the process of
- algorithm selection
- model tuning
- data preprocessing
- final model selection
If I wanted to test changes in the way I preprocess my data (i.e. 5-NN Imputation, 10-NN Imputation, Drop Null Values) does this mean I have to have another cross validation layer and loop through those different ways or treat them as different hyperparameters.
1. In testing for a model stability do I report the average of the standard deviation of each different outer fold or is it better to concatenate predictions of each inner fold and take the standard deviation of that.

Any and all thoughts are appreciated. Since I come from a Physics Background and not a statistical one I assume that there are huge flaws in my reasoning. I also like reading papers if you guys have any on the right way for cross-validation.

Welcome to the list. This is three questions, each of which is fairly large in scope and each of which has been dealt with on this list. So, your question is being closed for needing more focus. Please re-post one question at a time. — Peter Flom, May 20 '20 at 12:45

score 1 · Answer 1 · answered May 18 '20 at 22:06

If I've properly understood what you've written, then yes. The only caveat here is that the data preprocessing is not based on the outcome. For example, many people select features by computing correlation with the outcome. Were you to do something like this, your validation procedure would look slightly different.
Yes. The different modes of data preprocessing are essentially different models. They require validation.
Report the mean and standard deviation of the validation scores on the outer folds.

Nested Cross Validation: How to do the whole Shebang (Algorithmic Selection, Model Selection, Parameter Tuning, Preprocessing)

Background

Problem

1 Answers1