Neural network training better without MinMaxScaler()?

Question

Forum,

I have a multivariate time series problem; For my master thesis I am investigating whether it is possible to forecast the movement direction of stock price with machine learning. My model looks as follows:

def sentdex_model(X_train):
    model = Sequential()
    model.add(LSTM(33, input_shape=(X_train.shape[1:]), return_sequences=True))
    model.add(LSTM(33, input_shape=(X_train.shape[1:])))
    model.add(Dense(90, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    # model.summary()
    return model

The input data is in the form of [#samples, timesteps, features]. The features are OHLCV (Open-High-Low-Close-Volume) data of 6 different telecom companies. Im trying to predict whether a stock will rise (1) or fall (0). So it is basically a time series classification problem. I've always learned that, when data is presented on very different intervals - it is good practice to MinMaxScale the input data into the neural network.

However, when I do so, the train accuracy of the model keeps hovering around the baseline of 0.50 (There is an equal amount of 1s (price rise) and 0s (price fall). So, the model is not really learning. Now, when i dont MinMaxScale, accuracy slowly climbs to around 75% over 50 epochs.

Can anyone explain why the model without MinMaxScaling seems to learn better than the model with MinMaxScaling?

score 1 · Accepted Answer · answered Jun 08 '21 at 13:03

Training accuracy simply describes how well a model can fit the training data, but it doesn't explain how well your model will generalise to unseen data. After each epoch evaluate accuracy on a validation set which the model never trains on. You may very well find that without MinMaxScaler the validation accuracy is worse than with.

Additionally MinMaxScaler isn't always the best choice for scaling. Often standardization is preferable as it enforces that features have a mean of zero and unit variance. It looks like you're using Sklearn for the preprocessing, and the corresponding standardization scaler in Sklearn is StandardScaler.

If you discover that the validation accuracy is significantly worse than the training accuracy for all epochs it implies your model cannot fit your data well enough. This suggests you should try introducing more layers dense layers as well as BatchNormalization after all but the last dense layer.

Data scaling is always beneficial and I can't think of a case where it will worsen accuracy in a sufficiently expressive model. So to that extent I think the better accuracy on training without scaling is just an coincidental artefact of saturation in your nonlinear activations from unscaled inputs coupled with a network architecture that is not complex enough to capture your problem.

Also in a binary classification situation, an accuracy of 50% can imply that the network isn't actually learning anything.

To summarize: evaluate accuracy on the validation set, then try StandardScaler, and then try introducing more dense layers and regularization techniques like BatchNormalization.

Thanks for the answer!! I read in this question (1st answer comments) https://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization that minmax is actually beneficial when data isn't normally distributed. Which isn't the case in my data, so I decided to go for MinMax. Moreover, I'm doing walk-forward validation - validating on 1 year of data. This means that for every day in this data I have to retrain my model - which makes experiments last long (+- 6 hours). Therefore I decided to first try and get train acc up. (1/2) — Psychotechnopath, Jun 08 '21 at 14:54
Moreover, since I am trying to predict stock price (Which by definition could be an "unpredictable" problem) it is hard to know whether the data is the problem or the network structure. You say I should use batchnorm as regularization technique - what about bias regularizing the lstm layers? and would adding more LSTM units/layers also be a good idea? How many dense layers shall I try? — Psychotechnopath, Jun 08 '21 at 14:55
@Psychotechnopath that's actually a very good point regarding the data not being normally distributed. And considering stock price can exhibit large fluctuations I'm wondering if maybe preprocessing the data is not the way to go, but rather perform some 'live processing': try adding a BatchNorm layer *before* the first LSTM. And in regards to regularising the LSTM layers or adding more, I'm not entirely sure, LSTMs arent my area of experience here. I'd say try 'fixing' the scaling issue first before adding more layers, and when you do add layers do it 1 at a time. — Avelina, Jun 08 '21 at 15:14

Neural network training better without MinMaxScaler()?

1 Answers1