Positive train score and negative test score in sklearn

Question

I am doing a regression model using kfold cross validation using a dataset with ~200 data and noticed my r2 score on train data is positive(average 0.7) and my r2 test score is negative. What does it mean in general for a positive r2 score on train set and negative r2 score on test set? I dont get the fact where training data captures the model pattern while the test set does not

score 0 · Answer 1 · answered Sep 27 '21 at 22:38

You have encountered a very common phenomenon that we call overfitting! Pretty much every statistical model fits its training data better than new, unseen test data. This is because the model is the best possible fit to the training data. Even if there is a real relationship between the inputs and the response, it is probably not exactly what you've estimated from the training data. So you should expect to see a lower $R^2$ on test data than on training data.

This same thing happens when there is actually no true relationship between the inputs and the response. In that case, the model fitting will overfit to any apparent pattern that it sees in the random noise. Then, when you try to apply this model to test data, its random noise will be different, and you may find that your model is worse than just predicting the new data will match its average. In that case, you can have a negative $R^2$ on the test data.

Positive train score and negative test score in sklearn

1 Answers1