We use two types of masks when we train transformer models one is in the architecture of encoder to adjust for the length of the input sequence and another is the mask that is being used by the decoder to prevent left ward flow of information(cheating).
I am confused as to whether we need to use the masking during test phase also or not (because we still need to do the dot product between Q and K in the test phase)? If so then is it same as that we use during train or different(my hunch is that is should be somewhat different in decoder phase). If not then how will we adjust the variable length the sentences.
I have tried gone through below mentioned articles but none of them tells about the masked thing.
- How to use the transformer for inference
- https://datascience.stackexchange.com/questions/51785/what-is-the-first-input-to-the-decoder-in-a-transformer-model?noredirect=1&lq=1
- https://datascience.stackexchange.com/questions/81727/what-would-be-the-target-input-for-transformer-decoder-during-test-phase