I have time sequences of 2D heat maps. For different people I have heat maps over time. For each person I have around 720 heat maps and in total around 50'000 heat maps.
Now, I would like to train an autoencoder on these heat maps to learn a meaningful low dimensional representation. I will use Keras in Python for the implementation.
I thought about using a CNN for encoder and decoder (but I'm unsure about the number of layers and the size of the layers). Another option could be to use a cycleGAN (pix2pix) or also taking time into account using LSTM but I don't know how to use LSTM for 2D image data. I have also heard that attention layers might be interesting.
What architectures and configurations are reasonable to use? And what loss function would you use?