0

My scenario.

I have datasets containing between 5,000 an 400,000 predictors (i.e., columns) and between 3,000 and 14,000 cases (i.e., rows) without any strata/subgroups. I perform a nested, n-times-repeated k-fold crossvalidation (nrkfold CV, for short), in which I fit an L2-reqularized multiple regression which predicts non-binary interval-scaled values. Each individual outer CV run contains an outer training and outer test data set (no validation sets ever). Using the given outer training set, an inner rkfold CV is performed to establish the best hyperparameter. This hyperparameter is then used to fit the L2-reqularized multiple regression to the outer training data. Finally, predictions on the outer test data set are derived. So far, so good.

Each outer CV run thus generates a vector of predictions, which have a corresponding test vector. For simplicity, let's assume that the vectors have always the same length on any CV run, m.

The predictions shall be scored by correlating them with the test values. By default, Pearson's r is used.

Three approaches to analyze the model's performance across folds.

1. Score each fold, average the resulting correlations.

This is kind of the standard thing to do, I guess. Here, one correlates each fold's predictions vector with the corresponding test vector. Each vector is of length m. Since I conduct n*k outer crossvalidations, I receive n*k correlations, which are averaged (after Fisher's z-transforming them).

2. Concatenate each fold's prediction and test vector.

Since n*k outer crossvalidations are conducted, I receive n*k predictions vectors and test vectors. For each vector type, I could concatenate the values, resulting in a prediction vector and test vector of length (n*k)*m. Then, I could correlate these vectors.

3. Average predictions across folds, then correlate the prediction vector and the test vector.

Finally, I could average the predictions across outer CVs; here, I would have to take care of different cases being predicted differently often. Assuming this is handled well, I end up with a prediction vector of length >m (the exact length depends on the number of unique predicted cases - its maximal length will be the number of all cases). Note that the corresponding test values don't need to be averaged across CVs, since they stay constant. Hence, I could again correlate the prediction vector with the test vector.

My thoughts on each approach.

Regarding 1.

  • Seems to be the standard approach, which in itself though is not an argument for its correctness.
  • It seems to be reasonable to judge each outer CV separately since each outputs a different statistical model (i.e., different multiple regression betas, based on a possibly unique hyperparameter), which generates idiosyncratic predictions.
  • Also, it might be comparable to Forman & Scholz conclusion about AUC: they show that it's best to compute the AUC for each fold separately and then average the AUCs over folds. One would have to show, though, that correlations behave similar to AUCs, which I don't see how.

Regarding 2.

  • One might argue that the logic of the paper by Forman & Scholz applies to correlation between predicted and test values as well. In essence, Forman & Scholz show that, for the F-measure, it is best to "total the number of true positives and false positives over the folds, then compute F-measure" (p.51). I am not sure, though, whether this is closer to concatenating (i.e., apporach 2) or averaging (i.e., approach 3) predicted values; they sum up true positives and false positives across CVs, but don't average them (which makes sense in their binary classification scheme).

Regarding 3.

  • Cases are predicted differently often, and might thus be predicted (after averaging) with different accuracy. This might unduly influence the correlation with the test vector.
  • Also, it doesn't feel quite right to average predictions for the same cases derived from different statistical models.
  • This approach is widely used in my research field, which is why I include it.

From my perspective, it boils down to the question which approach has the smallest bias when it comes to establishing the correlation between predicted and test values. I am not quite sure whether the variance of each approach is that important - while I could establish variance by repeating the whole nrkfold CV and see how variable each approach performs, I fail to imagine how I could check the bias of each approach. That would require to set up a valid simulation to know the ground truth.

This question has been posted already in several flavors, but either, none so far quite asks what I am after, or is not answered with a mathematical or empirical (could also be a simulation) argument. I also already cited the paper by Forman & Scholz, which tries to investigate this; however, they use different performance measures (namely, AUC and F-measure) than I want to use.

I would hugely welcome anyone with a mathematical or empirical argument for or against any of these three approaches! :)

cheshire
  • 21
  • 3

0 Answers0