I 'm trying to use a batch gradient descent algorithm to do linear regression on a large dataset- I load up as much data as my computer can handle, do a partial fit, print out some diagnostics to a CSV, and repeat
python pseudocode
m_scaler = sklearn.preprocessing.StandardScaler()
m_data = <3600*100*24*9 samples, with 15 X channels, and one y channel>
sgdR_01 = SGDRegressor(n_iter=m_iter,alpha = 10.0**-3)
i=0
while i < 100:
i=i+1
df = <select_360000_samples from m_data>
df = <preprocess>
m_scaler.partialFit(df[x_channels])
df[x_channels] = m_scaler.transform(df[x_channels])
X_train,y_train,X_test,y_test = trainTestSplit(df)
sgdR_01.partial_fit(X_train,y_train)
<track train/test score, train/test MSE, and coefficients for sgdR_01>
preprocessing steps:
add polynomial combinations of certain channels
oversample so y has a 'flat histogram'
randomly select ~ 200000 samples so my computer can handle the data
Right now, I'm hovering around .65 for my R^2 score, but every few iterations my score will drop to like -50 or -900. At the next iteration, it'll be back around .65 What's going on when that happens? Why is SGDR so erratic?