I have a sequence to sequence LSTM (encoder/decoder model) that I made following this tutorial. I'm trying to output a series of human poses (in the form of 3D coordinates) with shape (N, 17, 3). I'm training my model on dance choreography (where the pose changes constantly), but the issue is that the output of my model is pretty much the same value repeated N times.
During the evaluation phase, I save the model output (shape: (batch_size, seq_len, 17, 3)), and when looking at the output afterwards it's pretty much the same pose repeated seq_len times (and it's the same throughout all the batches). When trying a new sample for inference, I get pretty much the same pose repeated but with noise (slight shifts in the coordinates).
What's confusing is that during training the loss becomes very small (SmoothL1Loss approaches < 0.016). For loss I just compare the output sequence with a sequence of poses extracted from an example dance. It would seem to me that the model is finding the "average" pose that has the least loss for the whole sequence, but I'm trying to output a series of different poses.
Is this behavior a symptom of how I'm performing training (I'm not sure which information to include so please let me know if I left out important details), or is it because the loss function isn't able to enforce that poses should change by some delta? If it's the latter, are there any tutorials/recommendations for generating a custom loss function? I'd greatly appreciate any insight!
For some additional context, here's my model architecture: Encoder and Decoder are 2 layer LSTMs with dropout layers, wrapped in a seq2seq class that calls encoder(input) and then decodes that output one step at a time. For training I'm using an SGD optimizer and SmoothL1Loss as the loss function.