First post! If you don't want to read the background you can skip to the Problem heading below.
Background
Hello everyone, I'm a Physics student doing physics education research. My professor wants to build a model to classify students that are at risk of failure in a 15 week course. Getting the data and putting it into the right formats was the tedious part but now we have data for N = 10000 students with both in-class (HW 1 Score, Quiz 1 Score, Lab 1 Score, all the way up to Midterm 1) and out-of-class (gender,age,high school scores, SAT scores, ACT scores, current college GPA, Major, calculus scores) features.
Usually we perform ad-hoc analysis on excel (I know. Education Research still uses ancient technology for analysis) but since I've used Python for loads of Physics related simulations my professor gave this to me as a project for Summer.
From my fair share of reading books, online posts and papers here are some conclusions I've made.
- Cross Validation is used to test the generalization of a model
- Nested Cross Validation contains two cross validations loops where we use the inner cv loop for gridsearch (e.g. K-Fold CV for every available model) and the outer loop is used to measure the model performance. Reference
Both CV & Nested CV test the generalization of a model but Nested CV is more robust as it provides additional insight on model stability (i.e. variance of the generalization as training data is sliced differently) which could be a factor in algorithmic selection.
Given sci-kit learns nested cv example and this implementation of nested cv here what I have currently for my workflow process
Problem
Task : Create an early student failure detection algorithm that will predict the probability of whether or not a student will fail by week five of the course.
Data: N = 10000 students with both in-class (HW 1 Score, Quiz 1 Score, Lab 1 Score, all the way up to Midterm 1) and out-of-class (gender,age,high school scores, SAT scores, ACT scores, current college GPA, Major, calculus scores) features.
Steps:
- Split data between training and testing sets (80% training, 20% testing)
- Algorithmic Selection (SVC, Random Forest, NN, Logistic Regression)
- Nested CV On Training Data
- Stratified K-Fold CV On Training Data to get inner_training and inner_validation sets (outer cv layer)
- Preprocess data (scaling, dealing with na_values, dealing with categorical etc...)
- GridSearch Model on inner_training data (inner cv layer)
- Fit the best performing GridSearch Model on the inner_validation data
- Get the unbiased prediction error for this model
- Repeat for all algorithms
- Select best algorithm based on unbiased prediction error (let's assume Random Forest is the best algorithm)
- Run a loop on your hyper parameters for your model (Random Forest)
- Split training data using some cross-validation
- Preprocess data
- Fit your model with the parameters for this loop
- Find the best model parameters
- Now that you've selected the model with the best hyper parameters retrain on your entire training dataset and finally predict on the test set. This gives you an overall indicator of your model performance and because of the previous steps you can trust that you've build the most robust model and desirable model that will give you the best generalization on any future data
And you've done it. You've ended world hunger. The world rejoices. Kids prefer you to come down the chimney instead of Santa. Your dreams are no longer dreams but a reality that you've molded together.
There is also a paper on my topic specifically here if you want a little bit more context. This paper is why my professor wanted me to do research on this subject.
To be a bit more specific here are the questions I have:
- Does this workflow process correctly implement cross validation and the process of
- algorithm selection
- model tuning
- data preprocessing
- final model selection
- If I wanted to test changes in the way I preprocess my data (i.e. 5-NN Imputation, 10-NN Imputation, Drop Null Values) does this mean I have to have another cross validation layer and loop through those different ways or treat them as different hyperparameters.
- In testing for a model stability do I report the average of the standard deviation of each different outer fold or is it better to concatenate predictions of each inner fold and take the standard deviation of that.
Any and all thoughts are appreciated. Since I come from a Physics Background and not a statistical one I assume that there are huge flaws in my reasoning. I also like reading papers if you guys have any on the right way for cross-validation.