The results of CV on Ridge are different than the results of RidgeCV

Question

I am using cross_val_predict to generate cross-validated estimates using Ridge Regression:

reg = linear_model.Ridge(alpha = .5)
pred_r = cross_val_predict(reg, X, y, cv=None)

Based on this, the correlation between the predicted y and the real y is (0.114601783602, 0.00312638915351).

However, when I use RidgeCV instead:

reg = linear_model.RidgeCV(alphas=[0.1, 1.0, 10.0], cv=10, fit_intercept=True, scoring=None, normalize=False)
pred_r = reg.predict(X)

I get a relatively very high correlation: (0.330446577353, 2.3472470222e-18)

Why do I get so different results? I though these two analysis should generate the same output. Any ideas? Is the way I use RidgeCV correct and valid? Also, since I have around 600 sample, I believe it would be reasonable not to divide the data into training and test sets, and just do CV. I am double-checking this since the results might be published in a journal.

One thing that I came suddenly to think about: it might be that in the latter case you are fitting the model and predicting on the same data. Try doing cross_val_predict() on the ridgeCV model as well (assuming this is allowed). I would strongly advise agains just using the better results without understanding why they are better, especially if it is going into a proper journal — einar, Feb 21 '17 at 10:06
@transmetro thanks for the advise! You mean I should split the data? Or, you mean `_y = cross_val_predict(regRidgeCV, X, y, cv=10)`. Would that mean double cross-validated since RidgeCV alread does it? Also, I previously tried using the training data as test set, and in that case, yes, I obtained ridiculously high performance. — renakre, Feb 21 '17 at 10:12
I mean the latter. You are right that you would be doing cross validation twice. In cases such as ridge where you have a tuning parameter you should really do what is called nested cross-validation: an outer loop for estimating model performance and one cross-validation per fold where you estimate the tuning parameters. — einar, Feb 21 '17 at 10:16
You should generally avoid splitting the data, see my answer to this q: http://stats.stackexchange.com/questions/261643/do-i-need-a-global-test-set-when-using-k-fold-cross-validation-and-a-small-n-n/261879#261879 See also cbeleites' answer answer there — einar, Feb 21 '17 at 10:18
@transmetro In this case, I am getting 0.130 as the correlation, which is lower than the correlation obtained with only RidgeCV. Also, when I repeat the same on the resampled data I obtain a higher correlation (around .3). I am really confused by all these :) So, normally one can would to Ridge + CV separately. I thought using RidgeCV method of sklearn should be the same. But then you recommended to apply CV on RidgeCV and results changed again. With sampled data, the results change again. So, really not sure what is happening. — renakre, Feb 21 '17 at 10:52

einar · Accepted Answer · 2017-02-20T12:46:54.563

Cross validation is a randomized procedure. You randomly assign samples to one of $k$ folds, then estimate your error statistic in the usual $k$-fold cross validation way. Depending on which samples are in which fold, you will get different parameter estimates and hence different predictions.

There might be differences in the calls to the cross-validation functions as well, though I'm no expert in the implementation of either. In the first case it looks as though you have specified a single penalty parameter, while the second seems to cross-validate over a (small) grid of them. This, too, will affect the results of the cross-validation procedure. The second one has the opportunity to pick the better one out of three possible penalty parameters. You should read the documentation to both functions.

In summary: there is randomness inherent in cross-validation, so you can't expect the exact same results. Also you might be using the two functions differently and that is another opportunity for results to differ.

score 2 · Answer 2 · answered Feb 20 '17 at 14:21

2

In addition to the correct answer, I want to point out that you are using different $\alpha$ parameters.

Whereas in

linear_model.Ridge() you are using $0.5$,
in linear_model.RidgeCV() you allow $\alpha$ such as $[0.1, 1, 10]$

Further, defining cv=None, cross_val_predict() implies 3-fold-crossvalidation as you can read in the documentation. On the other hand you are using 10-fold crossvalidation in linear_model.RidgeCV() as you define cv=10. So besides the randomness introduced by crossvalidation as stated in the previous answer, your choice of parameters harms comparability of results.

answered Feb 20 '17 at 14:21

Nikolas Rieble

3,131
11
36

I have run the same experiment with **0.5** α value in both cases with `cv` is set to 10. But, still getting almost the same results. I am not sure that randomness can create such a difference. – renakre Feb 20 '17 at 14:36
Did you set alpha to [0.5] for linear_model.RidgeCV() or did you just add 0.5 such as [0.1, 0.5, 1, 10] ? Further, if you want to understand/evaluate the influence of the randomness due to cv, you might want to have a look at the reg.coef_ where you will probably see the differences in the model already. – Nikolas Rieble Feb 20 '17 at 14:48
No, I changed it to [0.5]. I repeat this on several chunks of the same data and obtaining the same trend: `ridgeCV` provides better results. If everything seem fine in my approach, then I will report it as it is. I will also check `reg.coef_` as you suggested, thanks! – renakre Feb 20 '17 at 15:03
1

Further, you might consider to plot your results such as in here: http://scikit-learn.org/stable/auto_examples/plot_cv_predict.html, which possibly helps to understand for both me and you. Finally, what about the accuracy? Could you provide the accuracy that you achieve in both cases? – Nikolas Rieble Feb 20 '17 at 15:03

The results of CV on Ridge are different than the results of RidgeCV

2 Answers2