I am making multivariate time series classification with TUH seizure corpus dataset
I have built this model with Keras, using LSTM layers :
model = Sequential()
model.add(LSTM(50, return_sequences=True,input_shape=(look_back, trainX.shape[2])))
model.add(LSTM(50))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.fit(trainX, trainY,validation_split=0.3, epochs=50, batch_size=1000, verbose=1)
and the results are surprising... When I compute the confusion_matrix like this :
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
print(confusion_matrix(trainY, trainPredict.round()))
print(confusion_matrix(testY, testPredict.round()))
I respectively get :
[[129261 0]
[ 172 129138]]
and
[[10822 0]
[10871 0]]
In other words, My training confusion matrix is quite fine while my testing confusion matrix classifies everybody as "positive". What is surprising is that I have quite perfectly balanced instances, both in training and testing set...
Why do I have this ?
EDIT :
My "preprocessing code", based on Jason Brownlee's tutorial looks like that : I "reshape" the data so that each data point is composed by the look_back
previous measurements before the target to be predicted, each of measurement actually consisting in 22 signals corresponding to EEG channels
def create_dataset(feat,targ, look_back=1):
dataX, dataY = [], []
print (len(targ)-look_back-1)
for i in range(len(targ)-look_back-1):
a = feat[i:(i+look_back), :]
dataX.append(a)
dataY.append(targ.iloc[i + look_back])
return np.array(dataX), np.array(dataY)
and then
look_back = 50
trainX, trainY = create_dataset(X_train_resampled,Y_train_resampled, look_back)
print ("loopback1 done")
testX, testY = create_dataset(X_test_resampled,Y_test_resampled, look_back)
I have dimension (#recordings, 50 (look_back), #features (22)) for trainX
and testX
I am not sure about if this way of working is adequate ? Maybe it's the cause of the error
Thanks
EDIT : even, when I split properly the data in a training, validation and test set before applying the model, like this :
validX, validY = create_dataset(X_valid_resampled,Y_valid_resampled, look_back)
I still get poor results : a confusion matrix with far too much false positives and very little (true) negatives.
My doubts : should I increase my time_steps window (i.e. look_back
parameter) ? Btw, is there a fine method to tune properly this parameter based on the context ?
Maybe the usage of create_dataset
is not appropriate (although it comes from a famous tuto) ? Indeed, it appears to me that it brings redudancy and thus correlation between sequences...
I hope someone could help