What could explain a loss geting very small quickly in a LSTM network?

Question

I am trying my first LSTM with keras to classify time-dependent data sets.

I have created a training and a testing data sets, which I have normalized:

   # Compute the mean and standard deviation for each feature of the training set
   train_data <- my.matrix[1:end.train,]
   mean_features <- apply(train_data, 2, mean, na.rm = TRUE)
   std <- apply(train_data, 2, sd, na.rm = TRUE)

   # Scaling the whole data set using the mean and sd of the training set
   my.matrix <- scale(my.matrix, center = mean_features, scale = std)

The categories are one-hot encoded and the final matrix is of dimension [84000, 10, 22], meaning I have 84000 observations with a rolling window (data.window) of ten observations and 22 features. My batch size is 500 but I tried multiple values.

Then I create the following LSTM:

model <- keras_model_sequential()

model %>%
  layer_lstm(units = 50,
             input_shape = c(data.window, dim(my.matrix)[3]),
             batch_size = batch.size,
             return_sequences = TRUE,
             recurrent_dropout = 0.2,
             stateful = TRUE) %>%
  layer_dropout(rate = 0.2) %>%
  layer_lstm(units = 50,
             recurrent_dropout = 0.2,
             return_sequences = FALSE,
             stateful = TRUE) %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 3, activation = 'sigmoid')

optimizer <- optimizer_sgd(lr = 0.1)

model %>%
  compile(
    loss = 'categorical_crossentropy',
    optimizer = optimizer,
    metrics = c('accuracy')
  )

model

Here is the model summary:

Model
Model: "sequential_4"
________________________________________________________________________________________
Layer (type)                           Output Shape                       Param #       
========================================================================================
lstm_4 (LSTM)                          (500, 10, 50)                      14600         
________________________________________________________________________________________
dropout_4 (Dropout)                    (500, 10, 50)                      0             
________________________________________________________________________________________
lstm_5 (LSTM)                          (500, 50)                          20200         
________________________________________________________________________________________
dropout_5 (Dropout)                    (500, 50)                          0             
________________________________________________________________________________________
dense_4 (Dense)                        (500, 3)                           153           
========================================================================================
Total params: 34,953
Trainable params: 34,953
Non-trainable params: 0
________________________________________________________________________________________

Launching the training is disappointing, as the loss drops very quickly and both the accuracy and loss become constant.

for(i in 1:2000){
  print(paste("Training epoch:", i))
  model %>% fit(x = train_data,
                y = train_labels,
                batch_size = batch.size,
                epochs = 1,
                verbose = 1,
                shuffle = FALSE)
  model %>% reset_states()
}

This gives:

[1] "Training epoch: 1"
168/168 [==============================] - 4s 25ms/step - loss: 0.2838 - accuracy: 0.2484
[1] "Training epoch: 2"
168/168 [==============================] - 4s 27ms/step - loss: 1.1921e-07 - accuracy: 0.2020
[1] "Training epoch: 3"
168/168 [==============================] - 4s 23ms/step - loss: 1.1921e-07 - accuracy: 0.2020
[1] "Training epoch: 4"
168/168 [==============================] - 4s 25ms/step - loss: 1.1921e-07 - accuracy: 0.2020
[1] "Training epoch: 5"
168/168 [==============================] - 4s 23ms/step - loss: 1.1921e-07 - accuracy: 0.2020
[1] "Training epoch: 6"
168/168 [==============================] - 4s 25ms/step - loss: 1.1921e-07 - accuracy: 0.2020
[1] "Training epoch: 7"
168/168 [==============================] - 4s 27ms/step - loss: 1.1921e-07 - accuracy: 0.2020
[1] "Training epoch: 8"
168/168 [==============================] - 4s 25ms/step - loss: 1.1921e-07 - accuracy: 0.2020

This goes on with no notable change.

I tried playing with the optimizer, the learning rate, the dropout, the recurrent dropout, etc.

Up to now, nothing ever changed the fact that from epoch 2, my loss is 10⁻7 and the accuracy does not move significantly.

This really is a pet project I'm doing to learn deep learning, but I end up trying random things without understanding them, it's frustrating.

I don't ask for you to solve the issue I'm having, more to give me pointers to where I could better understand this behaviour.

Edit: As suggested in the comments, I'll clarify. My problem is actually not that the loss is small. The problem is that the neural network does not seem to learn and I thought it was due to the small loss. If the loss is small, it means I have reached a minimum (at least a local one), which I thought would explain why my model was not learning anymore.

The problem I'm trying to "solve" is to predict if the price of a stock is going to increase or decrease (or stay range bound). I know this is a difficult problem and that there is no solution to it. But there are plenty of blog posts experimenting on the subject so I used them as an introduction to deep learning. Once more, I'm not trying to solve this, I'm trying to learn by example.

Loss on training data should be small and decreasing; the more meaningful question is whether or not the loss on some holdout set is comparably small. If that loss is large, you may be [tag:overfitting]. Alternatively, it's worth asking whether the strategy is appropriate for the problem. For example, sometimes with sequential data (e.g. time series) it's easy to use data from the future to predict the past. — Sycorax, Jun 05 '20 at 15:39
Thank you for your answer. No, I can't use the future to predict the past in my use case. I though that having a small loss meant that my NN converged towards a local optimum it couldn't get out of. That's why the first thing I tried was to increase the learning rate and the batch size. But I'm not even sure that's a good approach. — Ben, Jun 05 '20 at 15:48
The goal of minimization is to find a small loss, so finding a small loss on its own is not necessarily indicative of a problem. What problem are you trying to solve? Can you please edit your post to explain how a small loss is a symptom of that problem? — Sycorax, Jun 05 '20 at 15:50
What do you mean that the network isn't losing? The loss is small. Clearly it's learning something. — Sycorax, Jun 05 '20 at 16:18
I didn't say it wasn't losing, I said it seemed it was not learning. Maybe it learns, but very very slowly (the accuracy does not change visibly). Has it converged and stopped learning or is it still learning very slowly? I don't know. How can I increase the learning speed? — Ben, Jun 05 '20 at 16:23
That was a typo. It seems that what you mean is that the accuracy is low but the loss is also low. Accuracy isn't a good metric. See: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models Or it could be that the labels aren't coded correctly for what you want to achieve. — Sycorax, Jun 05 '20 at 16:35

What could explain a loss geting very small quickly in a LSTM network?

0 Answers0