Why K-fold cross validation estimate for MSE of a polynomial regression decreases monotonically as the polynomial degree increases?

Question

Based on the book "An Introduction to Statistical Learning" I learned that the CV estimate for test MSE has an typical U shape: an increasing degree in the polynomial fits the data better initially but after some point it starts to overfit the training data, which increases the estimated test MSE and gives the U shape.

In my case, the 10-fold CV test MSE estimate is monotonically decreasing:

I am modeling transit speed across a border and I have data from several months. A scatter plot of speed vs time of the day is like this:

where minutesDay is the number of minutes since midnight. One can see that there is a jam between 7am and 10am.

Why does the test MSE estimate decreases monotonically?

Is it because the dataset contains an important fraction of almost-duplicated observations (everyday the observations follow the same pattern), so there are high chances of getting a test partition that is very similar to the training partitions?

Maybe you haven't allowed enough degrees to see the U behavior. — Cagdas Ozgenc, Sep 11 '17 at 12:19
You are right. I did the test for 20 degrees and after degree 10 it starts to go slightly up. Pity that I cannot add another chart showing it. If you add your comment as an answer I will be glad to accept it. — Fernando González, Sep 11 '17 at 13:08

score 1 · Answer 1 · answered Sep 11 '17 at 13:20

Two possibilities:

your models have not yet reached the complexity where variance error dominates and increases RMSE (good)
(temporal) structure in your data is not taken into account by the splitting procedure. (bad: you do not guard against an important contribution to overfitting), e.g. today is easier to predict if yesterday and tomorrow are in the training data.
This differs from your suspected temporal structure: if there is a general structure wrt. hour-of-day, day-of-week etc. and the model will have these information also for unknown cases that it should predict in earnest - then it is good if the model recognizes this structure. This differs from e.g. predicting by taking into account what happened after the case in question which is information that will not be available for real predictions.

I think it's both. The point in temporal structure led me to this other question [https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection] where it explains how to make cross validation on time-series. — Fernando González, Sep 12 '17 at 20:13

score 0 · Accepted Answer · answered Sep 12 '17 at 07:19

When there are several data points for each value on X axis, polynomial fitting is not as sensitive as when there is a single data point for each value on X axis. One can increase the polynomial degree without severe fluctuations over the Y axis.

Having said that, as you increase the number of parameters, the variance of parameter estimates start to increase. But again in your scenario, as you add higher degree terms they are not causing big problems because these higher terms usually end up having small coefficients, effectively acting like lower degree polynomials.

If you enlarge the above graph to include more degrees you will eventually see a deterioration in performance albeit slowly.

Also take into consideration the suggestion in the other answer by adding some predictive variables such as day of week, holiday flags, value in the previous week but on the same day of week (i.e. if today is friday, what happened last friday), etc. This way you may end up having better prediction with fewer number of explanatory variables instead of increasing polynomial degree to high values. The model will be more stable and predictive for future.

Why K-fold cross validation estimate for MSE of a polynomial regression decreases monotonically as the polynomial degree increases?

2 Answers2