I have developed a deep learning model, to predict whether an image is affected by a certain disease or not. Accuracies of 99.8%, 88.8%, and 89% have been achieved on the training set, testing set, and validation set respectfully. I’m going to publish my research work in a journal, therefore, whichever accuracy will be the accuracy of my deep model? If I say the accuracy of 99.8% is the accuracy of my model, is it justified?
-
1Don't use accuracy, precision, recall, sensitivity, specificity, or the F1 score. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Oct 20 '21 at 13:27
-
1Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). – Stephan Kolassa Oct 20 '21 at 13:27
2 Answers
Let's borrow some definitions from MachineLearningMastery/Jason Brownlee
- Your training set is the sample of data used to fit the model.
- Your validation set is the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
- Your test set is the sample of data used to provide an *unbiased evaluation of a final parameter-tuned model's performance on new data.
In other words, the result of your experiment is the outcome of applying your model to new data, i.e. your test data. In your case, that means that your can report an accuracy of 89%.

- 111
- 2
You must always report the metrics of the test set; i.e. the data set that has been used only once to assess the performance of the final network. Anything else might be biased due to fitting or hyperparameter tuning. Then, you should also report other classification metrics alongside accuracy, such as recall or F1 score (again evaluated on the test set). For imbalanced data sets (i.e. one class label dominates), it's also important to balance the data set, or to take this into account when computing classification scores.

- 153
- 12
-
1Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Oct 20 '21 at 13:26
-
@StephanKolassa "not a problem" for what? If you train a classifier on a 1/99 class split, and eventually report an accuracy of 99% then this might sound impressive, but if the recall is 0% then it becomes clear that accuracy was simply misleading as it is dominated by the dominant class label. For deep learning, training on a 1/99 data set will strongly favor the dominant class and hence might result in a model that is biased towards that dominant class. – a_guest Oct 21 '21 at 08:26
-
Have you read the link? My argument is precisely that accuracy is an extremely poor and misleading KPI ([see also here](https://stats.stackexchange.com/a/312787/1352), and precision/recall/F1 suffer from the exact same issues), and that the concern with "unbalanced" classes is nothing but a consequence of this poor choice of evaluation measure. Unfortunately, instead of addressing the *cause* by scrapping the misleading KPI, people started oversampling or other ways of addressing the *symptoms*. – Stephan Kolassa Oct 21 '21 at 08:35
-
@StephanKolassa I have read the link, and since the OP asked about accuracy in the context of detecting a (likely rare) disease, I have included a comment about accuracy in the context of imbalanced data sets in my answer. Nowhere I argued that imbalanced classes pose a "problem" or should "raise concern" - there are various ways to deal with the situation. A neural network won't be fitted to maximize accuracy anyway. But an imbalanced class split might pose a challenge for training a neural network to correctly predict the underrepresented class, so it's important to be aware of that. – a_guest Oct 21 '21 at 09:05