higher order polynomial fits do not match training data

Question

I am fitting a high order polynomial fit (order 15+) to some simulated training data. I know that features become collinear as i increase the order of polynomial but i do not undersand why my fits are so off ! even in case of collinearity the fits should be reasonable. The issue is not related to size of the training data for example see figures below with lots of training samples

The code is below:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import KFold
import matplotlib.pyplot as plt
% matplotlib inline

def gen_data(num_train):
    np.random.seed(100)
    trainX =  np.float64(np.linspace(1,1.5,num_train))
    train_noise =  np.float64(np.random.normal(0, 0.1, num_train)) 
    trainY = np.sin(10*trainX) + train_noise
    return trainX, trainY

def polynomial_feature(X,order):
    for i in range(2,order+1):
        X = np.column_stack((X,X[:,0]**i))
    return X

def append_bias(X):
    num_ins, num_feas = X.shape[0], X.shape[1]
    Xb = np.ones((num_ins, num_feas+1))
    Xb[:, 1:] = X
    return Xb

def l2_closed_form(Xb, y):
    num_feas = Xb.shape[1]
    return np.dot(np.linalg.inv(np.dot(Xb.T, Xb)), np.dot(Xb.T, y))

def plotXY(trainX, trainY, plotX, plotY,title=None):
    plt.scatter(trainX[:, 1], trainY, s=  2)
    plt.plot(plotX[:, 1], plotY, color = 'k', linewidth=2)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.show()

def q1(trainX, trainY):
    theta = l2_closed_form(trainX, trainY)
    plotX, _ = gen_data(num_train=1000)
    plotX = polynomial_feature(plotX.reshape(-1,1),polynomalOrder)
    plotX = append_bias(plotX)
    plotY = np.dot(plotX, theta)
    plotXY(trainX, trainY, plotX, plotY)



def load_data():
    trainX, trainY= gen_data(num_train=15)
    trainX = polynomial_feature(trainX.reshape(-1,1),polynomalOrder)
    trainX = append_bias(trainX)
    return trainX, trainY

polynomalOrder = 15
global polynomalOrder
trainX, trainY = load_data()
q1(trainX, trainY)

You seem to have 20 points and you fit a 12th order polynomial. Nonsensical results are almost a certainty. Numerical Linear Algebra has its limit... :) Probably any fit above 5th order will be a shot in the dark. — usεr11852, Mar 12 '17 at 20:55
Check the condition number of $X^TX$. It's probably enormous. You're also computing the OLS parameter estimates in a very numerically-unstable way, cf a more stable algorithm like QR decomposition. But even then, you might have issues because of the collinearity that you acknowledge. Using a different basis (e.g. B-splines) would be better. — Sycorax, Mar 12 '17 at 20:57
I was expecting the higher order model is pass through all the training data .... why does not it ? linear fit should be unaffected by col-linearity the model parameter should settle down on something reasonable (passing through training data) although the fit itself and standard error of the estimates may be inflated. i suspect numerical instability and saturation, which i tried to fix by setting x-range to be small and declaring parameters as float(64). i do not get it — pemfir, Mar 12 '17 at 21:09
issue is not related to size of the data @usεr11852 , please see additional figures i added — pemfir, Mar 12 '17 at 21:13
Check @Sycorax suggestion; it is clearly an issue with your numerics, just originally when you had 20 points the issue was extremely blunt while now you will have to check $X^TX$ to convince yourself. If anything you are using matrix inversion instead of QR. The higher order model is not guaranteed to pass through the data. Check the threads [here](http://stats.stackexchange.com/questions/154485) and [here](http://stats.stackexchange.com/questions/160007) I think they will help you a lot. — usεr11852, Mar 12 '17 at 21:26
Fundamentally this is a question about numerical stability and basis expansion. You'll need to be familiar with finite-precision arithmetic, even if you're using doubles. Also your new plots have introduced a new wrinkle, which is that it appears the data are not on a line that can be matched by a 12th-order polynomial, so even if everything else were ok, you'd still have a problem. — Sycorax, Mar 12 '17 at 21:48
Also this answer is helpful re: algorithms http://stats.stackexchange.com/questions/160179/do-we-need-gradient-descent-to-find-the-coefficients-of-a-linear-regression-mode/164164#164164 — Sycorax, Mar 12 '17 at 22:04

score 1 · Answer 1 · answered Aug 21 '18 at 10:57

Check the condition number of $X$^$T$$X$. It's probably enormous. You're also computing the OLS parameter estimates in a very numerically-unstable way, cf a more stable algorithm like QR decomposition. But even then, you might have issues because of the collinearity that you acknowledge. Using a different basis (e.g. B-splines) would be better. – Sycorax

Check @Sycorax suggestion; it is clearly an issue with your numerics, just originally when you had 20 points the issue was extremely blunt while now you will have to check $X$^$T$$X$ to convince yourself. If anything you are using matrix inversion instead of QR. The higher order model is not guaranteed to pass through the data. Check the threads here and here I think they will help you a lot. – usεr11852

Fundamentally this is a question about numerical stability and basis expansion. You'll need to be familiar with finite-precision arithmetic, even if you're using doubles. Also your new plots have introduced a new wrinkle, which is that it appears the data are not on a line that can be matched by a 12th-order polynomial, so even if everything else were ok, you'd still have a problem. – Sycorax

I've copied these comments by @Sycorax and usεr11852 as a community wiki answer because the comments are, more or less, answers to this question. We have a dramatic gap between answers and questions. At least part of the problem is that some questions are answered in comments: if comments which answered the question were answers instead, we would have fewer unanswered questions. — mkt, Aug 21 '18 at 10:59
Also h/t to @ usεr11852 (couldn't tag both in the same comment) — mkt, Aug 21 '18 at 10:59

higher order polynomial fits do not match training data

1 Answers1