5

I'm building a ridge regression model in scikit-learn and trying to find the optimal degree polynomial to use. The data I'm working with is a fairly predictable time series of hourly traffic volumes, and I'm predicting said volumes from the date, hour, and day of the week. R-squared values increase for both my train and test sets as I generate higher degree polynomial features, but suddenly drop from .91 to -1.4 when I go from degree 8 to 9, signifying that the 9th-order model is worse than the 0-order model.

Any idea why this happens?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Kevin Zhang
  • 51
  • 1
  • 5
  • 8
    Quick guess: I don't use these packages, but there can be numerical issues with polynomial curve fitting. As $n$ gets bigger, $x^n$ can get absurdly large and you'll may get an ill-conditioned design matrix. Something to try (if the package doesn't do it for you... which it may) is to center the data at zero first (i.e. subtract off the mean of $x$) and do your curve fitting on that. – Matthew Gunn Jul 27 '16 at 15:56
  • 1
    That's a nice idea. Yeah I've noticed a lot of the coefficients the high order model generates are miniscule (1e-8 or smaller). So I might have found the breaking point for the optimization function, maybe the learning rate is too large, or it can't handle all the dimensions. – Kevin Zhang Jul 27 '16 at 16:33
  • 3
    Regarding the possibility that the high-order polynomial is ill-conditioned, I wrote [this answer](http://stats.stackexchange.com/questions/143324/what-is-the-significance-of-a-linear-dependency-in-a-polynomial-regression/143326#143326) However, I think it's more likely that a high-order polynomial is ill-advised because, well, it's generally a bad model. Especially for time-series data, there are better ways to account for time-varying trends like this, such as [time-series analysis](http://stats.stackexchange.com/questions/tagged/time-series). – Sycorax Jul 28 '16 at 03:08
  • 1
    @GeneralAbrial On first glance, which time series analysis tool would you look into? I read a bit into exponential smoothing and that seems to fit the bill. – Kevin Zhang Jul 28 '16 at 15:48
  • 1
    @KevinZhang The data you have sounds like it has trends that occur within days, within weeks, and within years. Gelman tackles a data set similar to this in *Bayesian Data Analysis* 3rd ed using Gaussian process regressions. But I'm not an expert on time-series analysis. It's a vast field. – Sycorax Jul 28 '16 at 17:10

2 Answers2

3

Fitting polynomials to time series data (without prior theory) is in my opinion never a good idea and is definitely anachronistic (i.e. an outdated approach). See Does the p-value in the incremental F-test determine how many trials I expect to get correct? for @w.huber's wise reflection. Perhaps you have latent structure reflecting level shifts/multiple time trends or memory structure using lags of one or more series or even anomalous data points (pulses) or even non-constant error variance. Remedying these issues can often lead to reasonable models as @Sycorax and https://stats.stackexchange.com/users/11887/kjetil-b-halvorsen pointed out.

IrishStat
  • 27,906
  • 5
  • 29
  • 55
1

I think this is happening due to overfitting. A large polynomial order model, sometimes tends to perform worse than the original model.

This is due to the fact that the large order model doesn't generalise well on data that it hasn't seen.

An overfit model always reduces the training error. It is natural for the training error to drop as you keep increasing the degree of your model. Infact, at some point, your training error will reach 0.

This is the relation between model complexity and training and test errors

Near
  • 156
  • 5