I've trained a CNN to do a binary classification based on 2D radar spectra. I've tried different dataset sizes (reaching 200.000 samples per class) and always make sure that the classes are represented equally. The architecture uses all kinds of regularisation techniques ranging from dropout to weight decay. I've tried different kinds of very basic architectures and have now reached the point where I'm only using a single convolutional layer followed by a single fcc layer as not to increase the model complexity any more than necessary. However, while during training the accuracy converges for the training and validation set to a value above 90%, testing the trained architecture on unseen data always has the same result: it basically classifies almost all unseen data as 1, almost never as 0.
I'm not sure what could cause this. Looking at the spectra my feeling is that they look random, irrespective of the class. However, since the training is converging so well, there must be some useful information in the data, no?
Anyone have a hunch what might cause this huge bias on unseen data?