Transformer model training takes longer and results in lower train and validation loss

Question

I have been making tests with Transformer model provided on Keras.io page, training for classification and seq2seq tasks in several datasets and compare Transformer to GRU/LSTM with almost same number of parameters in both models.

In all my experiments, everything but the model are controlled variable and they are constant. I conducted my tests with GPU on my desktop (GTX 1070), Google Colab GPU and AWS Tesla T4.

What I witnessed are:

Training Transformer model takes longer for same number of epochs
Both training and validation loss are much higher for Transformer model, than GRU/LSTM model at the end of same number of epochs.

I tried various number of "number_of_heads" and "number_of_transformer_blocks", result did not change really.

Transformers were said to be superior to RNNs like GRU/LSTM but I have never witnessed that. Therefore my question is, am I missing something?

Are you tuning the hyperparameters of the models? Which ones? How? — Sycorax, Sep 10 '21 at 13:51
I am not tuning hyperparameters, I use same parameters whenever possible such as optimizer (adam), learning rate, embedding size. Such a huge gap for vanilla models doesn't sound right. I can provide a comparison Notebook on Colab soon. — meliksahturker, Sep 10 '21 at 13:56
Tune the hyperparameters, especially the learning rate. The question you're implicitly trying to answer is "how does the best transformer network compare to the best LSTM/GRU network?" and the only way to get the best version of a network is to tune it. — Sycorax, Sep 10 '21 at 13:58
What you describe is not superiority. I can get a tuned MLP against RNN for a sequence model that does better than RNN for a task and data but that would not mean to advertise MLP as "superior." Having said that, I am open to advices of tuning. — meliksahturker, Sep 10 '21 at 14:03
You're saying that one network being better than an alternative, comparably-sized network is not a demonstration that one is "superior" to the other? What does "superior" mean to you? // Anyway, there's plenty of material about tuning neural networks that you can find using a search. Here's one example to get you started. https://stats.stackexchange.com/questions/342462/hyperparameter-tuning-in-neural-networks — Sycorax, Sep 10 '21 at 14:07

Transformer model training takes longer and results in lower train and validation loss

0 Answers0