How to explain huge bias on unseen data?

Question

I've trained a CNN to do a binary classification based on 2D radar spectra. I've tried different dataset sizes (reaching 200.000 samples per class) and always make sure that the classes are represented equally. The architecture uses all kinds of regularisation techniques ranging from dropout to weight decay. I've tried different kinds of very basic architectures and have now reached the point where I'm only using a single convolutional layer followed by a single fcc layer as not to increase the model complexity any more than necessary. However, while during training the accuracy converges for the training and validation set to a value above 90%, testing the trained architecture on unseen data always has the same result: it basically classifies almost all unseen data as 1, almost never as 0.

I'm not sure what could cause this. Looking at the spectra my feeling is that they look random, irrespective of the class. However, since the training is converging so well, there must be some useful information in the data, no?

Anyone have a hunch what might cause this huge bias on unseen data?

Are you using accuracy, sensitivity, specificity, F1 or similar measures to train your dataset? These will be optimized by biased predictions, especially (but not only) for unbalanced data. See [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) Incidentally, [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Jul 14 '21 at 08:23
@StephanKolassa I minimize the CrossEntropyLoss during training and then compute accuracy after training is done. — user132792, Jul 14 '21 at 11:42
OK, next question: how does your network go from probabilistic outputs to classifications? I assume it compares them to a threshold. That makes no sense unless you also consider the costs of misclassifications ([Reduce Classification Probability Threshold](https://stats.stackexchange.com/q/312119/1352)). Work directly with probabilistic outputs instead, and assess these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). — Stephan Kolassa, Jul 14 '21 at 11:46
In my mind, if you do a random train-val-test split. And your Train and Val Accuracy converge at a reasonable high value, then chances are that there must be some error in your code (for example when evaluating the testset) rather than huge bias. Not saying there is no way in the world this could not be the case, but in case of a random split val and test set should be similar. And both sets are not seen by the network (meaning you do not backpropagate the loss of the validation set). — Janosch, Jul 14 '21 at 13:01

How to explain huge bias on unseen data?

0 Answers0