I have a CNN with 3 convolutional layers, 1 max-pooling layer and 2 fully-connected layers before applying softmax classification. The CNN is trained with Adagrad and I achieve a quite good performance. However, I'm curious as to why my loss is incredibly very stochastic (see below). Throughout the 30000 iterations it will actually jump above the initial loss. Despite that, the accuracy of the CNN is pretty consistent throughout training. Could this be due to the use of dropout on the convolutional layers and fully-connected layers? If so, where is this mentioned in any scientific articles, lecture notes or tutorials?
Edit: Added parameters below. The learning rate and weight decay is found using gridsearch and different values doesn't change the loss much. It might be that I haven't tuned them completely correct though. Actually, I'm surprised how sensitive Adagrad is to these variables (increasing LR/WD by a factor 10 from these values actually causes my learning to diverge). I'll look into adding a validation/test loss when I have time.
- Learning Rate: 0.003
- Weight Decay: 0.0005
- Dropout: 0.5
- Minibatch-size: 10