0

I am trying to understand the Combinatorial Purged Cross-Validation (CPCV) method of Marcos Lopez de Prado's "Advances in Financial Machine Learning" book. There are a few things that I do not fully understand.

  1. It mentions that Walk forward test and cross validation provide only 1 backtest path. For walk forward test, I understand it is 1 path. But for CV, why is there only 1 backtest path ?
  2. My second question: According to CPCV, there are different combinations of data splits. At each split, there is a training set and test set. Now is the test set a validation set, which is in-sample ? Or is the test set out-of-sample, i.e. cross validation is applied to the training set only and the best model is found from there and then test it on the "test set"?

I have read similar post. However, I am still confused.

user1769197
  • 849
  • 2
  • 8
  • 20

1 Answers1

1

"2". In a normal K-fold CV, you estimate AUC/RMSE/… using only a small test set from every fold and it's ok for “normal” data where you can split data randomly. But in finance, you usually split data by date period and you cannot expect similar metrics values on the different data periods. Thus, it is better to combine all your test splits from every fold and calculate 1 metric for combined test sets. In CPCV you combine it in the 6 different ways, but test set usage is pretty the same.

"1". In the standard k-fold CV there is only one path, because every test part of data is used only once. For example, let's look at a 6 fold CV and the example provided by your link (data can be split into 6 groups): firstly, you keep G1 as a test set and train model using G2-G6. At the next step - G2 as the test and G1, G3-G6 for train and so on. At the end, you estimate sharpe/returns/or_something_else using predictions combined G1 from first training, G2 from the second, ... G6 from the 6th split in CV. In total, you have only 1 path here. As for the CPCV, every Group will be used as the test set 6 times here, so you can get 6 different sharpe/return/... calculated on the full data (G1-G6 groups).

Regards, Mark

markmipt
  • 11
  • 1