Why is my high degree polynomial regression model suddenly unfit for the data?

Question

I'm building a ridge regression model in scikit-learn and trying to find the optimal degree polynomial to use. The data I'm working with is a fairly predictable time series of hourly traffic volumes, and I'm predicting said volumes from the date, hour, and day of the week. R-squared values increase for both my train and test sets as I generate higher degree polynomial features, but suddenly drop from .91 to -1.4 when I go from degree 8 to 9, signifying that the 9th-order model is worse than the 0-order model.

Any idea why this happens?

Quick guess: I don't use these packages, but there can be numerical issues with polynomial curve fitting. As $n$ gets bigger, $x^n$ can get absurdly large and you'll may get an ill-conditioned design matrix. Something to try (if the package doesn't do it for you... which it may) is to center the data at zero first (i.e. subtract off the mean of $x$) and do your curve fitting on that. — Matthew Gunn, Jul 27 '16 at 15:56
That's a nice idea. Yeah I've noticed a lot of the coefficients the high order model generates are miniscule (1e-8 or smaller). So I might have found the breaking point for the optimization function, maybe the learning rate is too large, or it can't handle all the dimensions. — Kevin Zhang, Jul 27 '16 at 16:33
Regarding the possibility that the high-order polynomial is ill-conditioned, I wrote [this answer](http://stats.stackexchange.com/questions/143324/what-is-the-significance-of-a-linear-dependency-in-a-polynomial-regression/143326#143326) However, I think it's more likely that a high-order polynomial is ill-advised because, well, it's generally a bad model. Especially for time-series data, there are better ways to account for time-varying trends like this, such as [time-series analysis](http://stats.stackexchange.com/questions/tagged/time-series). — Sycorax, Jul 28 '16 at 03:08
@GeneralAbrial On first glance, which time series analysis tool would you look into? I read a bit into exponential smoothing and that seems to fit the bill. — Kevin Zhang, Jul 28 '16 at 15:48
@KevinZhang The data you have sounds like it has trends that occur within days, within weeks, and within years. Gelman tackles a data set similar to this in *Bayesian Data Analysis* 3rd ed using Gaussian process regressions. But I'm not an expert on time-series analysis. It's a vast field. — Sycorax, Jul 28 '16 at 17:10

IrishStat · Answer 1 · 2017-07-08T14:59:41.757

Fitting polynomials to time series data (without prior theory) is in my opinion never a good idea and is definitely anachronistic (i.e. an outdated approach). See Does the p-value in the incremental F-test determine how many trials I expect to get correct? for @w.huber's wise reflection. Perhaps you have latent structure reflecting level shifts/multiple time trends or memory structure using lags of one or more series or even anomalous data points (pulses) or even non-constant error variance. Remedying these issues can often lead to reasonable models as @Sycorax and https://stats.stackexchange.com/users/11887/kjetil-b-halvorsen pointed out.

Near · Answer 2 · 2016-07-28T07:35:12.317

1

I think this is happening due to overfitting. A large polynomial order model, sometimes tends to perform worse than the original model.

This is due to the fact that the large order model doesn't generalise well on data that it hasn't seen.

An overfit model always reduces the training error. It is natural for the training error to drop as you keep increasing the degree of your model. Infact, at some point, your training error will reach 0.

edited Jul 28 '16 at 07:35

answered Jul 28 '16 at 03:03

Near

156
5

1

Hi! Welcome to Cross Validated! Could I push you to elaborate a little bit more in your answer? I think you might be on to something, though it does seem odd for the *training* error to drop if the issue is overfitting. – Matt Krause Jul 28 '16 at 03:45
My train and test set errors are both huge when I increase the order polynomial, so this doesn't seem to be about overfitting. – Kevin Zhang Jul 28 '16 at 14:10
@KevinZhang Can you give me a link to the dataset? – Near Jul 28 '16 at 15:47
Sorry I can't, it's proprietary data – Kevin Zhang Jul 28 '16 at 15:49
@KevinZhang What is the optimisation algorithm that you're using? – Near Jul 28 '16 at 15:51
3

High-order polynomial very seldom are good models, polynomials are way to stiff. Try splines! search this site for splines. – kjetil b halvorsen Jun 08 '17 at 09:35

Why is my high degree polynomial regression model suddenly unfit for the data?

2 Answers2

Linked