What prevents my PyTorch convolutional auto-encoder to converge on some initializations?

Question

I built a small auto-encoder for greyscale images. It is there to make some tests, so I train it often, and I have a strange behavior.

On some initialisations, it does not converge. I mean, the MSE loss stay around 0.25 and never get down. The reconstructed images are uniform grey.

On most other initialisations, it converges to a loss around 0.13 during the first epoch.The reconstructed images are very blurry but it is a excepted for such a small network.

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1         [-1, 16, 288, 288]             160
         MaxPool2d-2           [-1, 16, 72, 72]               0
            Conv2d-3            [-1, 8, 72, 72]           1,160
         MaxPool2d-4            [-1, 8, 18, 18]               0
            Conv2d-5            [-1, 3, 18, 18]             219
         MaxPool2d-6              [-1, 3, 9, 9]               0
            Conv2d-7              [-1, 3, 9, 9]              84
          Upsample-8            [-1, 3, 18, 18]               0
            Conv2d-9            [-1, 8, 18, 18]             224
         Upsample-10            [-1, 8, 72, 72]               0
           Conv2d-11           [-1, 16, 72, 72]           1,168
         Upsample-12         [-1, 16, 288, 288]               0
           Conv2d-13          [-1, 1, 288, 288]             145
================================================================
Total params: 3,160
Trainable params: 3,160
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.32
Forward/backward pass size (MB): 22.84
Params size (MB): 0.01
Estimated Total Size (MB): 23.17
----------------------------------------------------------------
None

It is PyTorch 1.10.2 with Cuda 10.2.

The parameters are:

BATCH_SIZE = 8
EPOCHS = 10

normalize = transforms.Normalize(((0.5),(0.5)))
criterion = nn.MSELoss()
optimizer = optim.Adadelta(self.net.parameters())

The train loop is quite simple:

for epoch in range(self.epochs):  

    # Iterate over the data in batches
    for i, batch_data in enumerate(self.train_data.loader, 0):

        # get the inputs
        inputs = batch_data[0].to(self.device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward
        outputs = self.net(inputs)
        loss = self.criterion(outputs, inputs)

        # backward + optimize
        loss.backward()
        optimizer.step()

```

Without data, this is impossible to reproduce, so the only thing a reader can do is guess. Additionally, the code also appears to be incomplete because some objects are created but never used. My first guess is that the learning rate might be too large (you're just using the default, which suggests you haven't experimented with alternative values; this is poor practice), or max pool degrades the signal too much, or some runs get “unlucky” sequences of batches. These and more guesses are in the duplicate thread. — Sycorax, Feb 22 '22 at 13:22
Thanks, at least it means that there is no evident error in the displayed code. The learning rate may not be not the culprit. I tested several learning rate with Adam as well as this adaptative optimizer. I will switch to a larger network and pay attention to the max pools. — Xiiryo, Feb 22 '22 at 22:09
A network can still have bugs, in the sense of doing something you don't want, even if there are no error messages. If it's getting stuck at a bad loss value, I wonder if the culprit is a dying ReLU phenomenon. — Sycorax, Feb 22 '22 at 22:49

What prevents my PyTorch convolutional auto-encoder to converge on some initializations?

0 Answers0