I am currently reading this paper and following is the model used in it. I haven't been able to completely understand what is the purpose of using three tanh layers. I read the usage of tanh activation function, how it can reduce learning time compared to sigmoid activation function but I don't understand why three tanh layers are needed.
To explain what it is trying to do is take two sentences, add embdedding of each word in the sentence to create two 100d vector, concatenate these two vectors to form a 200d vector and input it to the stack of tanh layer followed by a 3-way softmax since there are classes to classify into.